[jira] [Updated] (SPARK-8561) Drop table can only drop the tables under database default
[ https://issues.apache.org/jira/browse/SPARK-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8561: - Component/s: SQL Drop table can only drop the tables under database default Key: SPARK-8561 URL: https://issues.apache.org/jira/browse/SPARK-8561 Project: Spark Issue Type: Bug Components: SQL Reporter: baishuo -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8588) Could not use concat with UDF in where clause
[ https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-8588: Priority: Critical (was: Blocker) Could not use concat with UDF in where clause - Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Priority: Critical After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8587: - Component/s: MLlib Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8551) Python example code for elastic net
[ https://issues.apache.org/jira/browse/SPARK-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8551: - Component/s: PySpark Priority: Minor (was: Major) Python example code for elastic net --- Key: SPARK-8551 URL: https://issues.apache.org/jira/browse/SPARK-8551 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Shuo Xiang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser
[ https://issues.apache.org/jira/browse/SPARK-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8585: - Component/s: SQL Priority: Minor (was: Major) (Components et al please: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) Support LATERAL VIEW in Spark SQL parser Key: SPARK-8585 URL: https://issues.apache.org/jira/browse/SPARK-8585 Project: Spark Issue Type: Improvement Components: SQL Reporter: Konstantin Shaposhnikov Priority: Minor It would be good to support LATERAL VIEW SQL syntax without need to create HiveContext. Docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8565) TF-IDF drops records
[ https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8565. -- Resolution: Not A Problem I can't find that bit of the docs but I assume it would refer to something done by the TF-IDF process. If you count the source (or some other transformation of the source), and then later apply TF-IDF, even if that caches something, it's already caching a different view. TF-IDF drops records Key: SPARK-8565 URL: https://issues.apache.org/jira/browse/SPARK-8565 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: PJ Van Aeken When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an RDD[Vector] back with only 1204 records. This prevents me from zipping it with the original so I can reattach the document ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8589) cleanup DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8589: --- Assignee: (was: Apache Spark) cleanup DateTimeUtils - Key: SPARK-8589 URL: https://issues.apache.org/jira/browse/SPARK-8589 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890 ] Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 6:54 AM: --- [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much better to check the rule before calling write function instead of adding the rule in checkanalysis.scala was (Author: animeshbaranawal): [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to check the rule before calling write function instead of adding the rule in checkanalysis.scala Better AnalysisException for writing DataFrame with identically named columns - Key: SPARK-8072 URL: https://issues.apache.org/jira/browse/SPARK-8072 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker We should check if there are duplicate columns, and if yes, throw an explicit error message saying there are duplicate columns. See current error message below. {code} In [3]: df.withColumn('age', df.age) Out[3]: DataFrame[age: bigint, name: string, age: bigint] In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') --- Py4JJavaError Traceback (most recent call last) ipython-input-4-eecb85256898 in module() 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode) 350 df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) 351 -- 352 self._jwrite.mode(mode).parquet(path) 353 354 @since(1.4) /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args) 535 answer = self.gateway_client.send_command(command) 536 return_value = get_return_value(answer, self.gateway_client, -- 537 self.target_id, self.name) 538 539 for temp_arg in temp_args: /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o35.parquet. : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could be: age#0L, age#3L.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at
[jira] [Resolved] (SPARK-8138) Error message for discovered conflicting partition columns is not intuitive
[ https://issues.apache.org/jira/browse/SPARK-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8138. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6610 [https://github.com/apache/spark/pull/6610] Error message for discovered conflicting partition columns is not intuitive --- Key: SPARK-8138 URL: https://issues.apache.org/jira/browse/SPARK-8138 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.5.0 For data stored as a Hive-style partitioned table, data files should only live in leaf partition directories. For example, the following directory layout is illegal: {noformat} . ├── _SUCCESS ├── b=0 │ ├── c=0 │ │ └── part-r-4.gz.parquet │ └── part-r-4.gz.parquet └── b=1 ├── c=1 │ └── part-r-8.gz.parquet └── part-r-8.gz.parquet {noformat} For now, we give an unintuitive error message like this: {noformat} Conflicting partition column names detected: ArrayBuffer(b, c) ArrayBuffer(b) {noformat} This should be improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8589) cleanup DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599113#comment-14599113 ] Apache Spark commented on SPARK-8589: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6980 cleanup DateTimeUtils - Key: SPARK-8589 URL: https://issues.apache.org/jira/browse/SPARK-8589 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8589) cleanup DateTimeUtils
[ https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8589: --- Assignee: Apache Spark cleanup DateTimeUtils - Key: SPARK-8589 URL: https://issues.apache.org/jira/browse/SPARK-8589 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599006#comment-14599006 ] Apache Spark commented on SPARK-8587: - User 'samos123' has created a pull request for this issue: https://github.com/apache/spark/pull/6979 Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599007#comment-14599007 ] Sam Stoelinga commented on SPARK-8587: -- Implemented code example for PySpark: https://github.com/apache/spark/pull/6979 feel free to discard this pull request for a proper implementation in Scala and Java also. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8587: --- Assignee: (was: Apache Spark) Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8587: --- Assignee: Apache Spark Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Assignee: Apache Spark Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8431) Add in operator to DataFrame Column in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8431: - Assignee: Yu Ishikawa Add in operator to DataFrame Column in SparkR - Key: SPARK-8431 URL: https://issues.apache.org/jira/browse/SPARK-8431 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 To filter values in a set, we should add {{%in%}} operation into SparkR. {noformat} df$a %in% c(1, 2, 3) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication
[ https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8359: - Assignee: Liang-Chi Hsieh Spark SQL Decimal type precision loss on multiplication --- Key: SPARK-8359 URL: https://issues.apache.org/jira/browse/SPARK-8359 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Rene Treffer Assignee: Liang-Chi Hsieh Fix For: 1.5.0 It looks like the precision of decimal can not be raised beyond ~2^112 without causing full value truncation. The following code computes the power of two up to a specific point {code} import org.apache.spark.sql.types.Decimal val one = Decimal(1) val two = Decimal(2) def pow(n : Int) : Decimal = if (n = 0) { one } else { val a = pow(n - 1) a.changePrecision(n,0) two.changePrecision(n,0) a * two } (109 to 120).foreach(n = println(pow(n).toJavaBigDecimal.unscaledValue.toString)) 649037107316853453566312041152512 1298074214633706907132624082305024 2596148429267413814265248164610048 5192296858534827628530496329220096 1038459371706965525706099265844019 2076918743413931051412198531688038 4153837486827862102824397063376076 8307674973655724205648794126752152 1661534994731144841129758825350430 3323069989462289682259517650700860 6646139978924579364519035301401720 1329227995784915872903807060280344 {code} Beyond ~2^112 the precision is truncated even if the precision was set to n and should thus handle 10^n without problems.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7235) Refactor the GroupingSet implementation
[ https://issues.apache.org/jira/browse/SPARK-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7235: - Assignee: Cheng Hao Refactor the GroupingSet implementation --- Key: SPARK-7235 URL: https://issues.apache.org/jira/browse/SPARK-7235 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.5.0 The logical plan `Expand` takes the `output` as constructor argument, which break the references chain for logical plan optimization. We need to refactor the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8104) move the auto alias logic into Analyzer
[ https://issues.apache.org/jira/browse/SPARK-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8104: - Assignee: Wenchen Fan move the auto alias logic into Analyzer --- Key: SPARK-8104 URL: https://issues.apache.org/jira/browse/SPARK-8104 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8589) cleanup DateTimeUtils
Wenchen Fan created SPARK-8589: -- Summary: cleanup DateTimeUtils Key: SPARK-8589 URL: https://issues.apache.org/jira/browse/SPARK-8589 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8181) date/time function: hour
[ https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8181: --- Assignee: (was: Apache Spark) date/time function: hour Key: SPARK-8181 URL: https://issues.apache.org/jira/browse/SPARK-8181 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin hour(string|date|timestamp): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8182) date/time function: minute
[ https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8182: --- Assignee: Apache Spark date/time function: minute -- Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8199) date/time function: date_format
[ https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599123#comment-14599123 ] Apache Spark commented on SPARK-8199: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: date_format --- Key: SPARK-8199 URL: https://issues.apache.org/jira/browse/SPARK-8199 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin date_format(date/timestamp/string ts, string fmt): string Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'. date_format can be used to implement other UDFs, e.g.: dayname(date) is date_format(date, '') dayofyear(date) is date_format(date, 'D') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8199) date/time function: date_format
[ https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8199: --- Assignee: (was: Apache Spark) date/time function: date_format --- Key: SPARK-8199 URL: https://issues.apache.org/jira/browse/SPARK-8199 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin date_format(date/timestamp/string ts, string fmt): string Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'. date_format can be used to implement other UDFs, e.g.: dayname(date) is date_format(date, '') dayofyear(date) is date_format(date, 'D') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8179) date/time function: month
[ https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8179: --- Assignee: (was: Apache Spark) date/time function: month - Key: SPARK-8179 URL: https://issues.apache.org/jira/browse/SPARK-8179 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin month(string|date|timestamp): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8180) date/time function: day / dayofmonth
[ https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599128#comment-14599128 ] Apache Spark commented on SPARK-8180: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: day / dayofmonth - Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin day(string|date|timestamp): int dayofmonth(string|date|timestamp): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8182) date/time function: minute
[ https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8182: --- Assignee: (was: Apache Spark) date/time function: minute -- Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8184) date/time function: weekofyear
[ https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599124#comment-14599124 ] Apache Spark commented on SPARK-8184: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: weekofyear -- Key: SPARK-8184 URL: https://issues.apache.org/jira/browse/SPARK-8184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin weekofyear(string|date|timestamp): int Returns the week number of a timestamp string: weekofyear(1970-11-01 00:00:00) = 44, weekofyear(1970-11-01) = 44. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8177) date/time function: year
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599130#comment-14599130 ] Apache Spark commented on SPARK-8177: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: year Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8180) date/time function: day / dayofmonth
[ https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8180: --- Assignee: (was: Apache Spark) date/time function: day / dayofmonth - Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin day(string|date|timestamp): int dayofmonth(string|date|timestamp): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8199) date/time function: date_format
[ https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8199: --- Assignee: Apache Spark date/time function: date_format --- Key: SPARK-8199 URL: https://issues.apache.org/jira/browse/SPARK-8199 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark date_format(date/timestamp/string ts, string fmt): string Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'. date_format can be used to implement other UDFs, e.g.: dayname(date) is date_format(date, '') dayofyear(date) is date_format(date, 'D') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8183) date/time function: second
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8183: --- Assignee: Apache Spark date/time function: second -- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8177) date/time function: year
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8177: --- Assignee: (was: Apache Spark) date/time function: year Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8179) date/time function: month
[ https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8179: --- Assignee: Apache Spark date/time function: month - Key: SPARK-8179 URL: https://issues.apache.org/jira/browse/SPARK-8179 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark month(string|date|timestamp): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8183) date/time function: second
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8183: --- Assignee: (was: Apache Spark) date/time function: second -- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8183) date/time function: second
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599125#comment-14599125 ] Apache Spark commented on SPARK-8183: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: second -- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8181) date/time function: hour
[ https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599127#comment-14599127 ] Apache Spark commented on SPARK-8181: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: hour Key: SPARK-8181 URL: https://issues.apache.org/jira/browse/SPARK-8181 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin hour(string|date|timestamp): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8180) date/time function: day / dayofmonth
[ https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8180: --- Assignee: Apache Spark date/time function: day / dayofmonth - Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark day(string|date|timestamp): int dayofmonth(string|date|timestamp): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8177) date/time function: year
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8177: --- Assignee: Apache Spark date/time function: year Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8184) date/time function: weekofyear
[ https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8184: --- Assignee: Apache Spark date/time function: weekofyear -- Key: SPARK-8184 URL: https://issues.apache.org/jira/browse/SPARK-8184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark weekofyear(string|date|timestamp): int Returns the week number of a timestamp string: weekofyear(1970-11-01 00:00:00) = 44, weekofyear(1970-11-01) = 44. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8179) date/time function: month
[ https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599129#comment-14599129 ] Apache Spark commented on SPARK-8179: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: month - Key: SPARK-8179 URL: https://issues.apache.org/jira/browse/SPARK-8179 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin month(string|date|timestamp): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8371. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6825 [https://github.com/apache/spark/pull/6825] improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8533) Bump Flume version to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8533: - Component/s: Streaming Priority: Minor (was: Major) Issue Type: Task (was: Bug) (Let's set component / type / priority) Bump Flume version to 1.6.0 --- Key: SPARK-8533 URL: https://issues.apache.org/jira/browse/SPARK-8533 Project: Spark Issue Type: Task Components: Streaming Reporter: Hari Shreedharan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598990#comment-14598990 ] Apache Spark commented on SPARK-8567: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6978 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars -- Key: SPARK-8567 URL: https://issues.apache.org/jira/browse/SPARK-8567 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: flaky-test Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8588) Could not use concat with UDF in where clause
StanZhai created SPARK-8588: --- Summary: Could not use concat with UDF in where clause Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Priority: Blocker After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at
[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming
[ https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8092: - Fix Version/s: (was: 1.4.1) [~rams] let's not set fix version until it's resolved https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark OneVsRest doesn't allow flexibility in label/ feature column renaming - Key: SPARK-8092 URL: https://issues.apache.org/jira/browse/SPARK-8092 Project: Spark Issue Type: Bug Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8184) date/time function: weekofyear
[ https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8184: --- Assignee: (was: Apache Spark) date/time function: weekofyear -- Key: SPARK-8184 URL: https://issues.apache.org/jira/browse/SPARK-8184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin weekofyear(string|date|timestamp): int Returns the week number of a timestamp string: weekofyear(1970-11-01 00:00:00) = 44, weekofyear(1970-11-01) = 44. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8182) date/time function: minute
[ https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599126#comment-14599126 ] Apache Spark commented on SPARK-8182: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6981 date/time function: minute -- Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8181) date/time function: hour
[ https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8181: --- Assignee: Apache Spark date/time function: hour Key: SPARK-8181 URL: https://issues.apache.org/jira/browse/SPARK-8181 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark hour(string|date|timestamp): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8565) TF-IDF drops records
[ https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599035#comment-14599035 ] PJ Van Aeken commented on SPARK-8565: - Ok, caching the source RDD works. But wouldn't the tf.cache() as described in the documentation of TF-IDF already materialize the ES source? TF-IDF drops records Key: SPARK-8565 URL: https://issues.apache.org/jira/browse/SPARK-8565 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: PJ Van Aeken When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an RDD[Vector] back with only 1204 records. This prevents me from zipping it with the original so I can reattach the document ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8049) OneVsRest's output includes a temp column
[ https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8049. -- Resolution: Fixed OneVsRest's output includes a temp column - Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.1, 1.5.0 The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8214: --- Assignee: Apache Spark (was: zhichao-li) math function: hex -- Key: SPARK-8214 URL: https://issues.apache.org/jira/browse/SPARK-8214 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark hex(BIGINT a): string hex(STRING a): string hex(BINARY a): string If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING. (See http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598961#comment-14598961 ] Apache Spark commented on SPARK-8214: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/6976 math function: hex -- Key: SPARK-8214 URL: https://issues.apache.org/jira/browse/SPARK-8214 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li hex(BIGINT a): string hex(STRING a): string hex(BINARY a): string If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING. (See http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8214: --- Assignee: zhichao-li (was: Apache Spark) math function: hex -- Key: SPARK-8214 URL: https://issues.apache.org/jira/browse/SPARK-8214 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li hex(BIGINT a): string hex(STRING a): string hex(BINARY a): string If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING. (See http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8111: - Assignee: Alok Singh SparkR shell should display Spark logo and version banner on startup Key: SPARK-8111 URL: https://issues.apache.org/jira/browse/SPARK-8111 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Matei Zaharia Assignee: Alok Singh Priority: Trivial Labels: Starter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598937#comment-14598937 ] Sean Owen commented on SPARK-8111: -- [~shivaram] done though I also just made you a JIRA admin, so that you can add Contributors at https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles (Just be aware you can now edit lots of things in JIRA so careful what you click!) SparkR shell should display Spark logo and version banner on startup Key: SPARK-8111 URL: https://issues.apache.org/jira/browse/SPARK-8111 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Matei Zaharia Assignee: Alok Singh Priority: Trivial Labels: Starter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185 ] Jaromir Vanek edited comment on SPARK-8393 at 6/24/15 9:48 AM: --- I think the suggested workaround is fine for the current 1.x version of Spark. So updating the documentation would be proper solution to prevent other developers from unexpected problems. But in the next major version of Spark it should be fixed properly and {{awaitTerminatio}} method should be declared to throw {{InterruptedException}}. was (Author: vanekjar): I think the suggested workaround is fine for the current 1.x version of Spark. So updating the documentation would be proper solution to prevent other developers from unexpected problems. But in the next major version of Spark it should be fixed properly and `awaitTermination` method should be declared to throw `InterruptedException`. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185 ] Jaromir Vanek edited comment on SPARK-8393 at 6/24/15 9:49 AM: --- I think the suggested workaround is fine for the current 1.x version of Spark. So updating the documentation would be proper solution to prevent other developers from unexpected problems. But in the next major version of Spark it should be fixed properly and {{awaitTermination}} method should be declared to throw {{InterruptedException}}. was (Author: vanekjar): I think the suggested workaround is fine for the current 1.x version of Spark. So updating the documentation would be proper solution to prevent other developers from unexpected problems. But in the next major version of Spark it should be fixed properly and {{awaitTerminatio}} method should be declared to throw {{InterruptedException}}. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6912) Support MapK,V as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6912: Affects Version/s: 1.4.0 Support MapK,V as a return type in Hive UDF - Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly
[ https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599278#comment-14599278 ] Apache Spark commented on SPARK-6785: - User 'ckadner' has created a pull request for this issue: https://github.com/apache/spark/pull/6983 DateUtils can not handle date before 1970/01/01 correctly - Key: SPARK-6785 URL: https://issues.apache.org/jira/browse/SPARK-6785 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Christian Kadner {code} scala val d = new Date(100) d: java.sql.Date = 1969-12-31 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d)) res1: java.sql.Date = 1970-01-01 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8175) date/time function: from_unixtime
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8175: --- Assignee: Apache Spark date/time function: from_unixtime - Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8174) date/time function: unix_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8174: --- Assignee: (was: Apache Spark) date/time function: unix_timestamp -- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8188) date/time function: from_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8188: --- Assignee: (was: Apache Spark) date/time function: from_utc_timestamp -- Key: SPARK-8188 URL: https://issues.apache.org/jira/browse/SPARK-8188 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is UTC and converts to given timezone (as of Hive 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 1970-01-01 00:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8191) date/time function: to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8191: --- Assignee: Apache Spark date/time function: to_utc_timestamp Key: SPARK-8191 URL: https://issues.apache.org/jira/browse/SPARK-8191 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark to_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is in given timezone and converts to UTC (as of Hive 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 1970-01-01 08:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8188) date/time function: from_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599315#comment-14599315 ] Apache Spark commented on SPARK-8188: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6984 date/time function: from_utc_timestamp -- Key: SPARK-8188 URL: https://issues.apache.org/jira/browse/SPARK-8188 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is UTC and converts to given timezone (as of Hive 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 1970-01-01 00:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8175) date/time function: from_unixtime
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8175: --- Assignee: (was: Apache Spark) date/time function: from_unixtime - Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8175) date/time function: from_unixtime
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599314#comment-14599314 ] Apache Spark commented on SPARK-8175: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6984 date/time function: from_unixtime - Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8174) date/time function: unix_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8174: --- Assignee: Apache Spark date/time function: unix_timestamp -- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8191) date/time function: to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599316#comment-14599316 ] Apache Spark commented on SPARK-8191: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6984 date/time function: to_utc_timestamp Key: SPARK-8191 URL: https://issues.apache.org/jira/browse/SPARK-8191 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin to_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is in given timezone and converts to UTC (as of Hive 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 1970-01-01 08:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8188) date/time function: from_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8188: --- Assignee: Apache Spark date/time function: from_utc_timestamp -- Key: SPARK-8188 URL: https://issues.apache.org/jira/browse/SPARK-8188 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark from_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is UTC and converts to given timezone (as of Hive 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 1970-01-01 00:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8191) date/time function: to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8191: --- Assignee: (was: Apache Spark) date/time function: to_utc_timestamp Key: SPARK-8191 URL: https://issues.apache.org/jira/browse/SPARK-8191 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin to_utc_timestamp(timestamp, string timezone): timestamp Assumes given timestamp is in given timezone and converts to UTC (as of Hive 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 1970-01-01 08:00:00. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8590) add code gen for ExtractValue
Wenchen Fan created SPARK-8590: -- Summary: add code gen for ExtractValue Key: SPARK-8590 URL: https://issues.apache.org/jira/browse/SPARK-8590 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8590) add code gen for ExtractValue
[ https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599183#comment-14599183 ] Apache Spark commented on SPARK-8590: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6982 add code gen for ExtractValue - Key: SPARK-8590 URL: https://issues.apache.org/jira/browse/SPARK-8590 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185 ] Jaromir Vanek commented on SPARK-8393: -- I think the suggested workaround is fine for the current 1.x version of Spark. So updating the documentation would be proper solution to prevent other developers from unexpected problems. But in the next major version of Spark it should be fixed properly and `awaitTermination` method should be declared to throw `InterruptedException`. JavaStreamingContext#awaitTermination() throws non-declared InterruptedException Key: SPARK-8393 URL: https://issues.apache.org/jira/browse/SPARK-8393 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: Jaromir Vanek Priority: Trivial Call to {{JavaStreamingContext#awaitTermination()}} can throw {{InterruptedException}} which cannot be caught easily in Java because it's not declared in {{@throws(classOf[InterruptedException])}} annotation. This {{InterruptedException}} comes originally from {{ContextWaiter}} where Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8590) add code gen for ExtractValue
[ https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8590: --- Assignee: (was: Apache Spark) add code gen for ExtractValue - Key: SPARK-8590 URL: https://issues.apache.org/jira/browse/SPARK-8590 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8590) add code gen for ExtractValue
[ https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8590: --- Assignee: Apache Spark add code gen for ExtractValue - Key: SPARK-8590 URL: https://issues.apache.org/jira/browse/SPARK-8590 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8195) date/time function: last_day
[ https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8195: --- Assignee: Apache Spark date/time function: last_day Key: SPARK-8195 URL: https://issues.apache.org/jira/browse/SPARK-8195 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark last_day(string date): string last_day(date date): date Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. The time part of date is ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8195) date/time function: last_day
[ https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599338#comment-14599338 ] Apache Spark commented on SPARK-8195: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6986 date/time function: last_day Key: SPARK-8195 URL: https://issues.apache.org/jira/browse/SPARK-8195 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin last_day(string date): string last_day(date date): date Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. The time part of date is ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8195) date/time function: last_day
[ https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8195: --- Assignee: (was: Apache Spark) date/time function: last_day Key: SPARK-8195 URL: https://issues.apache.org/jira/browse/SPARK-8195 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin last_day(string date): string last_day(date date): date Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. The time part of date is ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8196) date/time function: next_day
[ https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8196: --- Assignee: (was: Apache Spark) date/time function: next_day - Key: SPARK-8196 URL: https://issues.apache.org/jira/browse/SPARK-8196 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin next_day(string start_date, string day_of_week): string next_day(date start_date, string day_of_week): string Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8196) date/time function: next_day
[ https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8196: --- Assignee: Apache Spark date/time function: next_day - Key: SPARK-8196 URL: https://issues.apache.org/jira/browse/SPARK-8196 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark next_day(string start_date, string day_of_week): string next_day(date start_date, string day_of_week): string Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8196) date/time function: next_day
[ https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599339#comment-14599339 ] Apache Spark commented on SPARK-8196: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6986 date/time function: next_day - Key: SPARK-8196 URL: https://issues.apache.org/jira/browse/SPARK-8196 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin next_day(string start_date, string day_of_week): string next_day(date start_date, string day_of_week): string Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599188#comment-14599188 ] Gurjot Singh commented on SPARK-8540: - Can you please elaborate, what does b) do? Will it simply return the specified number of outliers/datapoints which are at farthest distance from their cluster mean, even if they are not outlier in statistical terms? KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6747: Affects Version/s: 1.4.0 Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro Labels: 1.5.0 The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. However, it has one difficulty because of type erasure in JVM. We assume that lines below are appended in HiveInspectors#javaClassToDataType; // list type case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] = val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType] println(tpe.getActualTypeArguments()(0).toString()) = 'E' This logic fails to catch a component type in List. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6747: Labels: 1.5.0 (was: ) Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro Labels: 1.5.0 The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. However, it has one difficulty because of type erasure in JVM. We assume that lines below are appended in HiveInspectors#javaClassToDataType; // list type case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] = val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType] println(tpe.getActualTypeArguments()(0).toString()) = 'E' This logic fails to catch a component type in List. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8192) date/time function: current_date
[ https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599318#comment-14599318 ] Apache Spark commented on SPARK-8192: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6985 date/time function: current_date Key: SPARK-8192 URL: https://issues.apache.org/jira/browse/SPARK-8192 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin current_date(): date Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value. We should just replace this with a date literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8193) date/time function: current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8193: --- Assignee: Apache Spark date/time function: current_timestamp - Key: SPARK-8193 URL: https://issues.apache.org/jira/browse/SPARK-8193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark current_timestamp(): timestamp Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. We should just replace this with a timestamp literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8193) date/time function: current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599319#comment-14599319 ] Apache Spark commented on SPARK-8193: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6985 date/time function: current_timestamp - Key: SPARK-8193 URL: https://issues.apache.org/jira/browse/SPARK-8193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin current_timestamp(): timestamp Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. We should just replace this with a timestamp literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8192) date/time function: current_date
[ https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8192: --- Assignee: (was: Apache Spark) date/time function: current_date Key: SPARK-8192 URL: https://issues.apache.org/jira/browse/SPARK-8192 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin current_date(): date Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value. We should just replace this with a date literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8192) date/time function: current_date
[ https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8192: --- Assignee: Apache Spark date/time function: current_date Key: SPARK-8192 URL: https://issues.apache.org/jira/browse/SPARK-8192 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark current_date(): date Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value. We should just replace this with a date literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8193) date/time function: current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8193: --- Assignee: (was: Apache Spark) date/time function: current_timestamp - Key: SPARK-8193 URL: https://issues.apache.org/jira/browse/SPARK-8193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin current_timestamp(): timestamp Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. We should just replace this with a timestamp literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8484: --- Assignee: Martin Zapletal (was: Apache Spark) Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Martin Zapletal Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600133#comment-14600133 ] Apache Spark commented on SPARK-8484: - User 'zapletal-martin' has created a pull request for this issue: https://github.com/apache/spark/pull/6996 Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Martin Zapletal Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8575) Deprecate callUDF in favor of udf
[ https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8575: Shepherd: Michael Armbrust Deprecate callUDF in favor of udf - Key: SPARK-8575 URL: https://issues.apache.org/jira/browse/SPARK-8575 Project: Spark Issue Type: Improvement Components: SQL Reporter: Benjamin Fradet Assignee: Benjamin Fradet Priority: Minor Fix For: 1.5.0 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to use {{callUDF}} in favor of {{udf}} wherever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8484: --- Assignee: Apache Spark (was: Martin Zapletal) Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8575) Deprecate callUDF in favor of udf
[ https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8575: Assignee: Benjamin Fradet Deprecate callUDF in favor of udf - Key: SPARK-8575 URL: https://issues.apache.org/jira/browse/SPARK-8575 Project: Spark Issue Type: Improvement Components: SQL Reporter: Benjamin Fradet Assignee: Benjamin Fradet Priority: Minor Fix For: 1.5.0 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to use {{callUDF}} in favor of {{udf}} wherever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5133: - Target Version/s: 1.5.0 Remaining Estimate: 168h Original Estimate: 168h Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Original Estimate: 168h Remaining Estimate: 168h Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent itemsets improvements (SPARK-7211) * R-like stats for ML models (SPARK-7674) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) h2. SparkR API for ML h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter
[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600140#comment-14600140 ] Joseph K. Bradley commented on SPARK-5133: -- It's high time we add this to MLlib, so I'm adding this to the 1.5 roadmap. [~peter.prettenhofer] If you are still interested in this, please feel free to take it. Or if others are interested, please comment on this JIRA. The initial API should be quite simple; I'm imagining a single method returning importance for each feature, modeled after what R or other libraries return. I think we should calculate importance based on the learned model. The permutation test would be nice in the future but would be much more expensive (shuffling data). Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7244) Find vertex sequences satisfying predicates
[ https://issues.apache.org/jira/browse/SPARK-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7244: - Description: It would be useful to be able to search graphs (efficiently) based on a sequence of predicates, and to return matching contiguous subsequences of vertices. The returned info should probably be an RDD. This could also be called motif-finding. (was: It would be useful to be able to search graphs (efficiently) based on a sequence of predicates, and to return matching contiguous subsequences of vertices. The returned info should probably be an RDD.) Find vertex sequences satisfying predicates --- Key: SPARK-7244 URL: https://issues.apache.org/jira/browse/SPARK-7244 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Joseph K. Bradley Priority: Minor It would be useful to be able to search graphs (efficiently) based on a sequence of predicates, and to return matching contiguous subsequences of vertices. The returned info should probably be an RDD. This could also be called motif-finding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org