date:20150624


 [ 
https://issues.apache.org/jira/browse/SPARK-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8561:
-
Component/s: SQL

 Drop table can only drop the tables under database default
 

 Key: SPARK-8561
 URL: https://issues.apache.org/jira/browse/SPARK-8561
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: baishuo





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-8588:

Priority: Critical  (was: Blocker)

 Could not use concat with UDF in where clause
 -

 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
 standalone cluster(or local).
Reporter: StanZhai
Priority: Critical

 After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
 following exception when use concat with UDF in where clause: 
 {code}
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 dataType on unresolved object, tree: 
 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
 at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
 at scala.collection.immutable.List.exists(List.scala:84) 
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
 at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8587:
-
Component/s: MLlib

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8551) Python example code for elastic net


 [ 
https://issues.apache.org/jira/browse/SPARK-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8551:
-
Component/s: PySpark
   Priority: Minor  (was: Major)

 Python example code for elastic net
 ---

 Key: SPARK-8551
 URL: https://issues.apache.org/jira/browse/SPARK-8551
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Shuo Xiang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser


 [ 
https://issues.apache.org/jira/browse/SPARK-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8585:
-
Component/s: SQL
   Priority: Minor  (was: Major)

(Components et al please: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)

 Support LATERAL VIEW in Spark SQL parser
 

 Key: SPARK-8585
 URL: https://issues.apache.org/jira/browse/SPARK-8585
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Konstantin Shaposhnikov
Priority: Minor

 It would be good to support LATERAL VIEW SQL syntax without need to create 
 HiveContext.
 Docs: 
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8565) TF-IDF drops records


 [ 
https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8565.
--
Resolution: Not A Problem

I can't find that bit of the docs but I assume it would refer to something done 
by the TF-IDF process. If you count the source (or some other transformation of 
the source), and then later apply TF-IDF, even if that caches something, it's 
already caching a different view.

 TF-IDF drops records
 

 Key: SPARK-8565
 URL: https://issues.apache.org/jira/browse/SPARK-8565
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: PJ Van Aeken

 When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an 
 RDD[Vector] back with only 1204 records. This prevents me from zipping it 
 with the original so I can reattach the document ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8589) cleanup DateTimeUtils


 [ 
https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8589:
---

Assignee: (was: Apache Spark)

 cleanup DateTimeUtils
 -

 Key: SPARK-8589
 URL: https://issues.apache.org/jira/browse/SPARK-8589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-24 Thread Animesh Baranawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 6:54 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much better to check the rule before calling write function instead of adding 
the rule in checkanalysis.scala


was (Author: animeshbaranawal):
[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to check the rule before calling write function instead of 
adding the rule in checkanalysis.scala

 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at

[jira] [Resolved] (SPARK-8138) Error message for discovered conflicting partition columns is not intuitive

2015-06-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8138.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6610
[https://github.com/apache/spark/pull/6610]

 Error message for discovered conflicting partition columns is not intuitive
 ---

 Key: SPARK-8138
 URL: https://issues.apache.org/jira/browse/SPARK-8138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.5.0


 For data stored as a Hive-style partitioned table, data files should only 
 live in leaf partition directories.
 For example, the following directory layout is illegal:
 {noformat}
 .
 ├── _SUCCESS
 ├── b=0
 │   ├── c=0
 │   │   └── part-r-4.gz.parquet
 │   └── part-r-4.gz.parquet
 └── b=1
 ├── c=1
 │   └── part-r-8.gz.parquet
 └── part-r-8.gz.parquet
 {noformat}
 For now, we give an unintuitive error message like this:
 {noformat}
 Conflicting partition column names detected:
  ArrayBuffer(b, c)
 ArrayBuffer(b)
 {noformat}
 This should be improved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8589) cleanup DateTimeUtils


[ 
https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599113#comment-14599113
 ] 

Apache Spark commented on SPARK-8589:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6980

 cleanup DateTimeUtils
 -

 Key: SPARK-8589
 URL: https://issues.apache.org/jira/browse/SPARK-8589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8589) cleanup DateTimeUtils


 [ 
https://issues.apache.org/jira/browse/SPARK-8589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8589:
---

Assignee: Apache Spark

 cleanup DateTimeUtils
 -

 Key: SPARK-8589
 URL: https://issues.apache.org/jira/browse/SPARK-8589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict


[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599006#comment-14599006
 ] 

Apache Spark commented on SPARK-8587:
-

User 'samos123' has created a pull request for this issue:
https://github.com/apache/spark/pull/6979

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Sam Stoelinga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599007#comment-14599007
 ] 

Sam Stoelinga commented on SPARK-8587:
--

Implemented code example for PySpark: https://github.com/apache/spark/pull/6979 
feel free to discard this pull request for a proper implementation in Scala and 
Java also.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8587) Return cost and cluster index KMeansModel.predict


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8587:
---

Assignee: (was: Apache Spark)

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8587) Return cost and cluster index KMeansModel.predict


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8587:
---

Assignee: Apache Spark

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Assignee: Apache Spark
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8431) Add in operator to DataFrame Column in SparkR


 [ 
https://issues.apache.org/jira/browse/SPARK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8431:
-
Assignee: Yu Ishikawa

 Add in operator to DataFrame Column in SparkR
 -

 Key: SPARK-8431
 URL: https://issues.apache.org/jira/browse/SPARK-8431
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Yu Ishikawa
Assignee: Yu Ishikawa
 Fix For: 1.5.0


 To filter values in a set, we should add {{%in%}} operation into SparkR.
 {noformat}
 df$a %in% c(1, 2, 3)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication


 [ 
https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8359:
-
Assignee: Liang-Chi Hsieh

 Spark SQL Decimal type precision loss on multiplication
 ---

 Key: SPARK-8359
 URL: https://issues.apache.org/jira/browse/SPARK-8359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Rene Treffer
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 It looks like the precision of decimal can not be raised beyond ~2^112 
 without causing full value truncation.
 The following code computes the power of two up to a specific point
 {code}
 import org.apache.spark.sql.types.Decimal
 val one = Decimal(1)
 val two = Decimal(2)
 def pow(n : Int) :  Decimal = if (n = 0) { one } else { 
   val a = pow(n - 1)
   a.changePrecision(n,0)
   two.changePrecision(n,0)
   a * two
 }
 (109 to 120).foreach(n = 
 println(pow(n).toJavaBigDecimal.unscaledValue.toString))
 649037107316853453566312041152512
 1298074214633706907132624082305024
 2596148429267413814265248164610048
 5192296858534827628530496329220096
 1038459371706965525706099265844019
 2076918743413931051412198531688038
 4153837486827862102824397063376076
 8307674973655724205648794126752152
 1661534994731144841129758825350430
 3323069989462289682259517650700860
 6646139978924579364519035301401720
 1329227995784915872903807060280344
 {code}
 Beyond ~2^112 the precision is truncated even if the precision was set to n 
 and should thus handle 10^n without problems..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7235) Refactor the GroupingSet implementation


 [ 
https://issues.apache.org/jira/browse/SPARK-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7235:
-
Assignee: Cheng Hao

 Refactor the GroupingSet implementation
 ---

 Key: SPARK-7235
 URL: https://issues.apache.org/jira/browse/SPARK-7235
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.5.0


 The logical plan `Expand` takes the `output` as constructor argument, which 
 break the references chain for logical plan optimization. We need to refactor 
 the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8104) move the auto alias logic into Analyzer


 [ 
https://issues.apache.org/jira/browse/SPARK-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8104:
-
Assignee: Wenchen Fan

 move the auto alias logic into Analyzer
 ---

 Key: SPARK-8104
 URL: https://issues.apache.org/jira/browse/SPARK-8104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0


 Currently we auto alias expression in parser. However, during parser phase we 
 don't have enough information to do the right alias. For example, Generator 
 that has more than 1 kind of element need MultiAlias, ExtractValue don't need 
 Alias if it's in middle of a ExtractValue chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8589) cleanup DateTimeUtils

2015-06-24 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-8589:
--

 Summary: cleanup DateTimeUtils
 Key: SPARK-8589
 URL: https://issues.apache.org/jira/browse/SPARK-8589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8181) date/time function: hour


 [ 
https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8181:
---

Assignee: (was: Apache Spark)

 date/time function: hour
 

 Key: SPARK-8181
 URL: https://issues.apache.org/jira/browse/SPARK-8181
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 hour(string|date|timestamp): int
 Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
 hour('12:58:59') = 12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8182) date/time function: minute


 [ 
https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8182:
---

Assignee: Apache Spark

 date/time function: minute
 --

 Key: SPARK-8182
 URL: https://issues.apache.org/jira/browse/SPARK-8182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 minute(string|date|timestamp): int
 Returns the minute of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8199) date/time function: date_format


[ 
https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599123#comment-14599123
 ] 

Apache Spark commented on SPARK-8199:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: date_format
 ---

 Key: SPARK-8199
 URL: https://issues.apache.org/jira/browse/SPARK-8199
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 date_format(date/timestamp/string ts, string fmt): string
 Converts a date/timestamp/string to a value of string in the format specified 
 by the date format fmt (as of Hive 1.2.0). Supported formats are Java 
 SimpleDateFormat formats – 
 https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. 
 The second argument fmt should be constant. Example: 
 date_format('2015-04-08', 'y') = '2015'.
 date_format can be used to implement other UDFs, e.g.:
 dayname(date) is date_format(date, '')
 dayofyear(date) is date_format(date, 'D')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8199) date/time function: date_format


 [ 
https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8199:
---

Assignee: (was: Apache Spark)

 date/time function: date_format
 ---

 Key: SPARK-8199
 URL: https://issues.apache.org/jira/browse/SPARK-8199
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 date_format(date/timestamp/string ts, string fmt): string
 Converts a date/timestamp/string to a value of string in the format specified 
 by the date format fmt (as of Hive 1.2.0). Supported formats are Java 
 SimpleDateFormat formats – 
 https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. 
 The second argument fmt should be constant. Example: 
 date_format('2015-04-08', 'y') = '2015'.
 date_format can be used to implement other UDFs, e.g.:
 dayname(date) is date_format(date, '')
 dayofyear(date) is date_format(date, 'D')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8179) date/time function: month


 [ 
https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8179:
---

Assignee: (was: Apache Spark)

 date/time function: month
 -

 Key: SPARK-8179
 URL: https://issues.apache.org/jira/browse/SPARK-8179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 month(string|date|timestamp): int
 Returns the month part of a date or a timestamp string: month(1970-11-01 
 00:00:00) = 11, month(1970-11-01) = 11.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8180) date/time function: day / dayofmonth


[ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599128#comment-14599128
 ] 

Apache Spark commented on SPARK-8180:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: day / dayofmonth 
 -

 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 day(string|date|timestamp): int
 dayofmonth(string|date|timestamp): int
 Returns the day part of a date or a timestamp string: day(1970-11-01 
 00:00:00) = 1, day(1970-11-01) = 1.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8182) date/time function: minute


 [ 
https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8182:
---

Assignee: (was: Apache Spark)

 date/time function: minute
 --

 Key: SPARK-8182
 URL: https://issues.apache.org/jira/browse/SPARK-8182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 minute(string|date|timestamp): int
 Returns the minute of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8184) date/time function: weekofyear


[ 
https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599124#comment-14599124
 ] 

Apache Spark commented on SPARK-8184:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: weekofyear
 --

 Key: SPARK-8184
 URL: https://issues.apache.org/jira/browse/SPARK-8184
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 weekofyear(string|date|timestamp): int
 Returns the week number of a timestamp string: weekofyear(1970-11-01 
 00:00:00) = 44, weekofyear(1970-11-01) = 44.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8177) date/time function: year


[ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599130#comment-14599130
 ] 

Apache Spark commented on SPARK-8177:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: year
 

 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 year(string|date|timestamp): int
 Returns the year part of a date or a timestamp string: year(1970-01-01 
 00:00:00) = 1970, year(1970-01-01) = 1970.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8180) date/time function: day / dayofmonth


 [ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8180:
---

Assignee: (was: Apache Spark)

 date/time function: day / dayofmonth 
 -

 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 day(string|date|timestamp): int
 dayofmonth(string|date|timestamp): int
 Returns the day part of a date or a timestamp string: day(1970-11-01 
 00:00:00) = 1, day(1970-11-01) = 1.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8199) date/time function: date_format


 [ 
https://issues.apache.org/jira/browse/SPARK-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8199:
---

Assignee: Apache Spark

 date/time function: date_format
 ---

 Key: SPARK-8199
 URL: https://issues.apache.org/jira/browse/SPARK-8199
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 date_format(date/timestamp/string ts, string fmt): string
 Converts a date/timestamp/string to a value of string in the format specified 
 by the date format fmt (as of Hive 1.2.0). Supported formats are Java 
 SimpleDateFormat formats – 
 https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. 
 The second argument fmt should be constant. Example: 
 date_format('2015-04-08', 'y') = '2015'.
 date_format can be used to implement other UDFs, e.g.:
 dayname(date) is date_format(date, '')
 dayofyear(date) is date_format(date, 'D')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8183) date/time function: second


 [ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8183:
---

Assignee: Apache Spark

 date/time function: second
 --

 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 second(string|date|timestamp): int
 Returns the second of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8177) date/time function: year


 [ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8177:
---

Assignee: (was: Apache Spark)

 date/time function: year
 

 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 year(string|date|timestamp): int
 Returns the year part of a date or a timestamp string: year(1970-01-01 
 00:00:00) = 1970, year(1970-01-01) = 1970.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8179) date/time function: month


 [ 
https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8179:
---

Assignee: Apache Spark

 date/time function: month
 -

 Key: SPARK-8179
 URL: https://issues.apache.org/jira/browse/SPARK-8179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 month(string|date|timestamp): int
 Returns the month part of a date or a timestamp string: month(1970-11-01 
 00:00:00) = 11, month(1970-11-01) = 11.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8183) date/time function: second


 [ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8183:
---

Assignee: (was: Apache Spark)

 date/time function: second
 --

 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 second(string|date|timestamp): int
 Returns the second of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8183) date/time function: second


[ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599125#comment-14599125
 ] 

Apache Spark commented on SPARK-8183:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: second
 --

 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 second(string|date|timestamp): int
 Returns the second of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8181) date/time function: hour


[ 
https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599127#comment-14599127
 ] 

Apache Spark commented on SPARK-8181:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: hour
 

 Key: SPARK-8181
 URL: https://issues.apache.org/jira/browse/SPARK-8181
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 hour(string|date|timestamp): int
 Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
 hour('12:58:59') = 12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8180) date/time function: day / dayofmonth


 [ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8180:
---

Assignee: Apache Spark

 date/time function: day / dayofmonth 
 -

 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 day(string|date|timestamp): int
 dayofmonth(string|date|timestamp): int
 Returns the day part of a date or a timestamp string: day(1970-11-01 
 00:00:00) = 1, day(1970-11-01) = 1.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8177) date/time function: year


 [ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8177:
---

Assignee: Apache Spark

 date/time function: year
 

 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 year(string|date|timestamp): int
 Returns the year part of a date or a timestamp string: year(1970-01-01 
 00:00:00) = 1970, year(1970-01-01) = 1970.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8184) date/time function: weekofyear


 [ 
https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8184:
---

Assignee: Apache Spark

 date/time function: weekofyear
 --

 Key: SPARK-8184
 URL: https://issues.apache.org/jira/browse/SPARK-8184
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 weekofyear(string|date|timestamp): int
 Returns the week number of a timestamp string: weekofyear(1970-11-01 
 00:00:00) = 44, weekofyear(1970-11-01) = 44.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8179) date/time function: month


[ 
https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599129#comment-14599129
 ] 

Apache Spark commented on SPARK-8179:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: month
 -

 Key: SPARK-8179
 URL: https://issues.apache.org/jira/browse/SPARK-8179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 month(string|date|timestamp): int
 Returns the month part of a date or a timestamp string: month(1970-11-01 
 00:00:00) = 11, month(1970-11-01) = 11.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf

2015-06-24 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8371.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6825
[https://github.com/apache/spark/pull/6825]

 improve unit test for MaxOf and MinOf
 -

 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8533) Bump Flume version to 1.6.0


 [ 
https://issues.apache.org/jira/browse/SPARK-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8533:
-
Component/s: Streaming
   Priority: Minor  (was: Major)
 Issue Type: Task  (was: Bug)

(Let's set component / type / priority)

 Bump Flume version to 1.6.0
 ---

 Key: SPARK-8533
 URL: https://issues.apache.org/jira/browse/SPARK-8533
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Hari Shreedharan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars


[ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598990#comment-14598990
 ] 

Apache Spark commented on SPARK-8567:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6978

 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
 --

 Key: SPARK-8567
 URL: https://issues.apache.org/jira/browse/SPARK-8567
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: flaky-test

 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai (JIRA)

StanZhai created SPARK-8588:
---

 Summary: Could not use concat with UDF in where clause
 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
standalone cluster(or local).
Reporter: StanZhai
Priority: Blocker


After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
following exception when use concat with UDF in where clause: 

{code}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 
'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
 
at 
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
at scala.collection.immutable.List.exists(List.scala:84) 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
 
at 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
 
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) 
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
at

[jira] [Updated] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming


 [ 
https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8092:
-
Fix Version/s: (was: 1.4.1)

[~rams] let's not set fix version until it's resolved
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 OneVsRest doesn't allow flexibility in label/ feature column renaming
 -

 Key: SPARK-8092
 URL: https://issues.apache.org/jira/browse/SPARK-8092
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8184) date/time function: weekofyear


 [ 
https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8184:
---

Assignee: (was: Apache Spark)

 date/time function: weekofyear
 --

 Key: SPARK-8184
 URL: https://issues.apache.org/jira/browse/SPARK-8184
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 weekofyear(string|date|timestamp): int
 Returns the week number of a timestamp string: weekofyear(1970-11-01 
 00:00:00) = 44, weekofyear(1970-11-01) = 44.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8182) date/time function: minute


[ 
https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599126#comment-14599126
 ] 

Apache Spark commented on SPARK-8182:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6981

 date/time function: minute
 --

 Key: SPARK-8182
 URL: https://issues.apache.org/jira/browse/SPARK-8182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 minute(string|date|timestamp): int
 Returns the minute of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8181) date/time function: hour


 [ 
https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8181:
---

Assignee: Apache Spark

 date/time function: hour
 

 Key: SPARK-8181
 URL: https://issues.apache.org/jira/browse/SPARK-8181
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 hour(string|date|timestamp): int
 Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
 hour('12:58:59') = 12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8565) TF-IDF drops records

2015-06-24 Thread PJ Van Aeken (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599035#comment-14599035
 ] 

PJ Van Aeken commented on SPARK-8565:
-

Ok, caching the source RDD works. But wouldn't the tf.cache() as described in 
the documentation of TF-IDF already materialize the ES source? 

 TF-IDF drops records
 

 Key: SPARK-8565
 URL: https://issues.apache.org/jira/browse/SPARK-8565
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: PJ Van Aeken

 When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an 
 RDD[Vector] back with only 1204 records. This prevents me from zipping it 
 with the original so I can reattach the document ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8049) OneVsRest's output includes a temp column


 [ 
https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8049.
--
Resolution: Fixed

 OneVsRest's output includes a temp column
 -

 Key: SPARK-8049
 URL: https://issues.apache.org/jira/browse/SPARK-8049
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.1, 1.5.0


 The temp accumulator column mbc$acc is included in the output which should 
 be removed with withoutColumn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8214) math function: hex


 [ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8214:
---

Assignee: Apache Spark  (was: zhichao-li)

 math function: hex
 --

 Key: SPARK-8214
 URL: https://issues.apache.org/jira/browse/SPARK-8214
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 hex(BIGINT a): string
 hex(STRING a): string
 hex(BINARY a): string
 If the argument is an INT or binary, hex returns the number as a STRING in 
 hexadecimal format. Otherwise if the number is a STRING, it converts each 
 character into its hexadecimal representation and returns the resulting 
 STRING. (See 
 http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
 BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8214) math function: hex


[ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598961#comment-14598961
 ] 

Apache Spark commented on SPARK-8214:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6976

 math function: hex
 --

 Key: SPARK-8214
 URL: https://issues.apache.org/jira/browse/SPARK-8214
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li

 hex(BIGINT a): string
 hex(STRING a): string
 hex(BINARY a): string
 If the argument is an INT or binary, hex returns the number as a STRING in 
 hexadecimal format. Otherwise if the number is a STRING, it converts each 
 character into its hexadecimal representation and returns the resulting 
 STRING. (See 
 http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
 BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8214) math function: hex


 [ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8214:
---

Assignee: zhichao-li  (was: Apache Spark)

 math function: hex
 --

 Key: SPARK-8214
 URL: https://issues.apache.org/jira/browse/SPARK-8214
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li

 hex(BIGINT a): string
 hex(STRING a): string
 hex(BINARY a): string
 If the argument is an INT or binary, hex returns the number as a STRING in 
 hexadecimal format. Otherwise if the number is a STRING, it converts each 
 character into its hexadecimal representation and returns the resulting 
 STRING. (See 
 http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
 BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup


 [ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8111:
-
Assignee: Alok Singh

 SparkR shell should display Spark logo and version banner on startup
 

 Key: SPARK-8111
 URL: https://issues.apache.org/jira/browse/SPARK-8111
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Assignee: Alok Singh
Priority: Trivial
  Labels: Starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup


[ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598937#comment-14598937
 ] 

Sean Owen commented on SPARK-8111:
--

[~shivaram] done though I also just made you a JIRA admin, so that you can add 
Contributors at 
https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles  
(Just be aware you can now edit lots of things in JIRA so careful what you 
click!)

 SparkR shell should display Spark logo and version banner on startup
 

 Key: SPARK-8111
 URL: https://issues.apache.org/jira/browse/SPARK-8111
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Assignee: Alok Singh
Priority: Trivial
  Labels: Starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-24 Thread Jaromir Vanek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185
 ] 

Jaromir Vanek edited comment on SPARK-8393 at 6/24/15 9:48 AM:
---

I think the suggested workaround is fine for the current 1.x version of Spark. 
So updating the documentation would be proper solution to prevent other 
developers from unexpected problems.

But in the next major version of Spark it should be fixed properly and 
{{awaitTerminatio}} method should be declared to throw {{InterruptedException}}.


was (Author: vanekjar):
I think the suggested workaround is fine for the current 1.x version of Spark. 
So updating the documentation would be proper solution to prevent other 
developers from unexpected problems.

But in the next major version of Spark it should be fixed properly and 
`awaitTermination` method should be declared to throw `InterruptedException`.

 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 {{InterruptedException}} which cannot be caught easily in Java because it's 
 not declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-24 Thread Jaromir Vanek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185
 ] 

Jaromir Vanek edited comment on SPARK-8393 at 6/24/15 9:49 AM:
---

I think the suggested workaround is fine for the current 1.x version of Spark. 
So updating the documentation would be proper solution to prevent other 
developers from unexpected problems.

But in the next major version of Spark it should be fixed properly and 
{{awaitTermination}} method should be declared to throw 
{{InterruptedException}}.


was (Author: vanekjar):
I think the suggested workaround is fine for the current 1.x version of Spark. 
So updating the documentation would be proper solution to prevent other 
developers from unexpected problems.

But in the next major version of Spark it should be fixed properly and 
{{awaitTerminatio}} method should be declared to throw {{InterruptedException}}.

 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 {{InterruptedException}} which cannot be caught easily in Java because it's 
 not declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6912) Support MapK,V as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6912:

Affects Version/s: 1.4.0

 Support MapK,V as a return type in Hive UDF
 -

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro

 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly


[ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599278#comment-14599278
 ] 

Apache Spark commented on SPARK-6785:
-

User 'ckadner' has created a pull request for this issue:
https://github.com/apache/spark/pull/6983

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Christian Kadner

 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8175) date/time function: from_unixtime


 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8175:
---

Assignee: Apache Spark

 date/time function: from_unixtime
 -

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8174) date/time function: unix_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8174:
---

Assignee: (was: Apache Spark)

 date/time function: unix_timestamp
 --

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string|date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8188) date/time function: from_utc_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8188:
---

Assignee: (was: Apache Spark)

 date/time function: from_utc_timestamp
 --

 Key: SPARK-8188
 URL: https://issues.apache.org/jira/browse/SPARK-8188
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is UTC and converts to given timezone (as of Hive 
 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 
 1970-01-01 00:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8191) date/time function: to_utc_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8191:
---

Assignee: Apache Spark

 date/time function: to_utc_timestamp
 

 Key: SPARK-8191
 URL: https://issues.apache.org/jira/browse/SPARK-8191
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 to_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is in given timezone and converts to UTC (as of Hive 
 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 
 1970-01-01 08:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8188) date/time function: from_utc_timestamp


[ 
https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599315#comment-14599315
 ] 

Apache Spark commented on SPARK-8188:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6984

 date/time function: from_utc_timestamp
 --

 Key: SPARK-8188
 URL: https://issues.apache.org/jira/browse/SPARK-8188
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is UTC and converts to given timezone (as of Hive 
 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 
 1970-01-01 00:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8175) date/time function: from_unixtime


 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8175:
---

Assignee: (was: Apache Spark)

 date/time function: from_unixtime
 -

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8175) date/time function: from_unixtime


[ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599314#comment-14599314
 ] 

Apache Spark commented on SPARK-8175:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6984

 date/time function: from_unixtime
 -

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8174) date/time function: unix_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8174:
---

Assignee: Apache Spark

 date/time function: unix_timestamp
 --

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string|date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8191) date/time function: to_utc_timestamp


[ 
https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599316#comment-14599316
 ] 

Apache Spark commented on SPARK-8191:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6984

 date/time function: to_utc_timestamp
 

 Key: SPARK-8191
 URL: https://issues.apache.org/jira/browse/SPARK-8191
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 to_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is in given timezone and converts to UTC (as of Hive 
 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 
 1970-01-01 08:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8188) date/time function: from_utc_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8188:
---

Assignee: Apache Spark

 date/time function: from_utc_timestamp
 --

 Key: SPARK-8188
 URL: https://issues.apache.org/jira/browse/SPARK-8188
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 from_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is UTC and converts to given timezone (as of Hive 
 0.8.0). For example, from_utc_timestamp('1970-01-01 08:00:00','PST') returns 
 1970-01-01 00:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8191) date/time function: to_utc_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8191:
---

Assignee: (was: Apache Spark)

 date/time function: to_utc_timestamp
 

 Key: SPARK-8191
 URL: https://issues.apache.org/jira/browse/SPARK-8191
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 to_utc_timestamp(timestamp, string timezone): timestamp
 Assumes given timestamp is in given timezone and converts to UTC (as of Hive 
 0.8.0). For example, to_utc_timestamp('1970-01-01 00:00:00','PST') returns 
 1970-01-01 08:00:00.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8590) add code gen for ExtractValue

2015-06-24 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-8590:
--

 Summary: add code gen for ExtractValue
 Key: SPARK-8590
 URL: https://issues.apache.org/jira/browse/SPARK-8590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8590) add code gen for ExtractValue


[ 
https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599183#comment-14599183
 ] 

Apache Spark commented on SPARK-8590:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6982

 add code gen for ExtractValue
 -

 Key: SPARK-8590
 URL: https://issues.apache.org/jira/browse/SPARK-8590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException

2015-06-24 Thread Jaromir Vanek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599185#comment-14599185
 ] 

Jaromir Vanek commented on SPARK-8393:
--

I think the suggested workaround is fine for the current 1.x version of Spark. 
So updating the documentation would be proper solution to prevent other 
developers from unexpected problems.

But in the next major version of Spark it should be fixed properly and 
`awaitTermination` method should be declared to throw `InterruptedException`.

 JavaStreamingContext#awaitTermination() throws non-declared 
 InterruptedException
 

 Key: SPARK-8393
 URL: https://issues.apache.org/jira/browse/SPARK-8393
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Jaromir Vanek
Priority: Trivial

 Call to {{JavaStreamingContext#awaitTermination()}} can throw 
 {{InterruptedException}} which cannot be caught easily in Java because it's 
 not declared in {{@throws(classOf[InterruptedException])}} annotation.
 This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
 Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8590) add code gen for ExtractValue


 [ 
https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8590:
---

Assignee: (was: Apache Spark)

 add code gen for ExtractValue
 -

 Key: SPARK-8590
 URL: https://issues.apache.org/jira/browse/SPARK-8590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8590) add code gen for ExtractValue


 [ 
https://issues.apache.org/jira/browse/SPARK-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8590:
---

Assignee: Apache Spark

 add code gen for ExtractValue
 -

 Key: SPARK-8590
 URL: https://issues.apache.org/jira/browse/SPARK-8590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8195) date/time function: last_day


 [ 
https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8195:
---

Assignee: Apache Spark

 date/time function: last_day
 

 Key: SPARK-8195
 URL: https://issues.apache.org/jira/browse/SPARK-8195
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 last_day(string date): string
 last_day(date date): date
 Returns the last day of the month which the date belongs to (as of Hive 
 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. 
 The time part of date is ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8195) date/time function: last_day


[ 
https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599338#comment-14599338
 ] 

Apache Spark commented on SPARK-8195:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6986

 date/time function: last_day
 

 Key: SPARK-8195
 URL: https://issues.apache.org/jira/browse/SPARK-8195
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 last_day(string date): string
 last_day(date date): date
 Returns the last day of the month which the date belongs to (as of Hive 
 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. 
 The time part of date is ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8195) date/time function: last_day


 [ 
https://issues.apache.org/jira/browse/SPARK-8195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8195:
---

Assignee: (was: Apache Spark)

 date/time function: last_day
 

 Key: SPARK-8195
 URL: https://issues.apache.org/jira/browse/SPARK-8195
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 last_day(string date): string
 last_day(date date): date
 Returns the last day of the month which the date belongs to (as of Hive 
 1.1.0). date is a string in the format '-MM-dd HH:mm:ss' or '-MM-dd'. 
 The time part of date is ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8196) date/time function: next_day


 [ 
https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8196:
---

Assignee: (was: Apache Spark)

 date/time function: next_day 
 -

 Key: SPARK-8196
 URL: https://issues.apache.org/jira/browse/SPARK-8196
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 next_day(string start_date, string day_of_week): string
 next_day(date start_date, string day_of_week): string
 Returns the first date which is later than start_date and named as 
 day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. 
 day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. 
 Mo, tue, FRIDAY). The time part of start_date is ignored. Example: 
 next_day('2015-01-14', 'TU') = 2015-01-20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8196) date/time function: next_day


 [ 
https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8196:
---

Assignee: Apache Spark

 date/time function: next_day 
 -

 Key: SPARK-8196
 URL: https://issues.apache.org/jira/browse/SPARK-8196
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 next_day(string start_date, string day_of_week): string
 next_day(date start_date, string day_of_week): string
 Returns the first date which is later than start_date and named as 
 day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. 
 day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. 
 Mo, tue, FRIDAY). The time part of start_date is ignored. Example: 
 next_day('2015-01-14', 'TU') = 2015-01-20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8196) date/time function: next_day


[ 
https://issues.apache.org/jira/browse/SPARK-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599339#comment-14599339
 ] 

Apache Spark commented on SPARK-8196:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6986

 date/time function: next_day 
 -

 Key: SPARK-8196
 URL: https://issues.apache.org/jira/browse/SPARK-8196
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 next_day(string start_date, string day_of_week): string
 next_day(date start_date, string day_of_week): string
 Returns the first date which is later than start_date and named as 
 day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. 
 day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. 
 Mo, tue, FRIDAY). The time part of start_date is ignored. Example: 
 next_day('2015-01-14', 'TU') = 2015-01-20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-06-24 Thread Gurjot Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599188#comment-14599188
 ] 

Gurjot Singh commented on SPARK-8540:
-

Can you please elaborate, what does b) do? Will it simply return the specified 
number of outliers/datapoints which are at farthest distance from their cluster 
mean, even if they are not outlier in statistical terms? 

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Affects Version/s: 1.4.0

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro
  Labels: 1.5.0

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.
 However, it has one difficulty because of type erasure in JVM.
 We assume that lines below are appended in HiveInspectors#javaClassToDataType;
 // list type
 case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
 val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
 println(tpe.getActualTypeArguments()(0).toString()) = 'E'
 This logic fails to catch a component type in List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Labels: 1.5.0  (was: )

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro
  Labels: 1.5.0

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.
 However, it has one difficulty because of type erasure in JVM.
 We assume that lines below are appended in HiveInspectors#javaClassToDataType;
 // list type
 case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
 val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
 println(tpe.getActualTypeArguments()(0).toString()) = 'E'
 This logic fails to catch a component type in List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8192) date/time function: current_date


[ 
https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599318#comment-14599318
 ] 

Apache Spark commented on SPARK-8192:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6985

 date/time function: current_date
 

 Key: SPARK-8192
 URL: https://issues.apache.org/jira/browse/SPARK-8192
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 current_date(): date
 Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
 All calls of current_date within the same query return the same value.
 We should just replace this with a date literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8193) date/time function: current_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8193:
---

Assignee: Apache Spark

 date/time function: current_timestamp
 -

 Key: SPARK-8193
 URL: https://issues.apache.org/jira/browse/SPARK-8193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 current_timestamp(): timestamp
 Returns the current timestamp at the start of query evaluation (as of Hive 
 1.2.0). All calls of current_timestamp within the same query return the same 
 value.
 We should just replace this with a timestamp literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8193) date/time function: current_timestamp


[ 
https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599319#comment-14599319
 ] 

Apache Spark commented on SPARK-8193:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6985

 date/time function: current_timestamp
 -

 Key: SPARK-8193
 URL: https://issues.apache.org/jira/browse/SPARK-8193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 current_timestamp(): timestamp
 Returns the current timestamp at the start of query evaluation (as of Hive 
 1.2.0). All calls of current_timestamp within the same query return the same 
 value.
 We should just replace this with a timestamp literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8192) date/time function: current_date


 [ 
https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8192:
---

Assignee: (was: Apache Spark)

 date/time function: current_date
 

 Key: SPARK-8192
 URL: https://issues.apache.org/jira/browse/SPARK-8192
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 current_date(): date
 Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
 All calls of current_date within the same query return the same value.
 We should just replace this with a date literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8192) date/time function: current_date


 [ 
https://issues.apache.org/jira/browse/SPARK-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8192:
---

Assignee: Apache Spark

 date/time function: current_date
 

 Key: SPARK-8192
 URL: https://issues.apache.org/jira/browse/SPARK-8192
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 current_date(): date
 Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
 All calls of current_date within the same query return the same value.
 We should just replace this with a date literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8193) date/time function: current_timestamp


 [ 
https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8193:
---

Assignee: (was: Apache Spark)

 date/time function: current_timestamp
 -

 Key: SPARK-8193
 URL: https://issues.apache.org/jira/browse/SPARK-8193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 current_timestamp(): timestamp
 Returns the current timestamp at the start of query evaluation (as of Hive 
 1.2.0). All calls of current_timestamp within the same query return the same 
 value.
 We should just replace this with a timestamp literal in the optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8484) Add TrainValidationSplit to ml.tuning


 [ 
https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8484:
---

Assignee: Martin Zapletal  (was: Apache Spark)

 Add TrainValidationSplit to ml.tuning
 -

 Key: SPARK-8484
 URL: https://issues.apache.org/jira/browse/SPARK-8484
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Martin Zapletal

 Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the 
 input dataset into train and validation and use evaluation metric on the 
 validation set to select the best model. It should be similar to 
 CrossValidator, but simpler and less expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8484) Add TrainValidationSplit to ml.tuning


[ 
https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600133#comment-14600133
 ] 

Apache Spark commented on SPARK-8484:
-

User 'zapletal-martin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6996

 Add TrainValidationSplit to ml.tuning
 -

 Key: SPARK-8484
 URL: https://issues.apache.org/jira/browse/SPARK-8484
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Martin Zapletal

 Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the 
 input dataset into train and validation and use evaluation metric on the 
 validation set to select the best model. It should be similar to 
 CrossValidator, but simpler and less expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8575) Deprecate callUDF in favor of udf

2015-06-24 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8575:

Shepherd: Michael Armbrust

 Deprecate callUDF in favor of udf
 -

 Key: SPARK-8575
 URL: https://issues.apache.org/jira/browse/SPARK-8575
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Assignee: Benjamin Fradet
Priority: Minor
 Fix For: 1.5.0


 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to 
 use {{callUDF}} in favor of {{udf}} wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8484) Add TrainValidationSplit to ml.tuning


 [ 
https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8484:
---

Assignee: Apache Spark  (was: Martin Zapletal)

 Add TrainValidationSplit to ml.tuning
 -

 Key: SPARK-8484
 URL: https://issues.apache.org/jira/browse/SPARK-8484
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the 
 input dataset into train and validation and use evaluation metric on the 
 validation set to select the best model. It should be similar to 
 CrossValidator, but simpler and less expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8575) Deprecate callUDF in favor of udf

2015-06-24 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8575:

Assignee: Benjamin Fradet

 Deprecate callUDF in favor of udf
 -

 Key: SPARK-8575
 URL: https://issues.apache.org/jira/browse/SPARK-8575
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Assignee: Benjamin Fradet
Priority: Minor
 Fix For: 1.5.0


 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to 
 use {{callUDF}} in favor of {{udf}} wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)


 [ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5133:
-
  Target Version/s: 1.5.0
Remaining Estimate: 168h
 Original Estimate: 168h

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

[
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-8445:
-
Description:
We expect to see many MLlib contributors for the 1.5 release. To scale out the
development, we created this master list for MLlib features we plan to have in
Spark 1.5. Please view this list as a wish list rather than a concrete plan,
because we don't have an accurate estimate of available resources. Due to
limited review bandwidth, features appearing on this list will get higher
priority during code review. But feel free to suggest new items to the list in
comments. We are experimenting with this process. Your feedback would be
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a
[starter
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
rather than a medium/big feature. Based on our experience, mixing the
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when
you start working on some features. This is to avoid duplicate work. For small
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned
first before coding and keep the ETA updated on the JIRA. If there exist no
activity on the JIRA page for a certain amount of time, the JIRA should be
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* After merging a PR, create and link JIRAs for Python, example code, and
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent itemsets improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)

h2. SparkR API for ML

h2. Documentation

* [Search for documentation improvements |
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the
development, we created this master list for MLlib features we plan to have in
Spark 1.5. Please view this list as a wish list rather than a concrete plan,
because we don't have an accurate estimate of available resources. Due to
limited review bandwidth, features appearing on this list will get higher
priority during code review. But feel free to suggest new items to the list in
comments. We are experimenting with this process. Your feedback would be
greatly appreciated.

h1. Instructions

h2. For contributors:

[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

[
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600140#comment-14600140
]

Joseph K. Bradley commented on SPARK-5133:
--

It's high time we add this to MLlib, so I'm adding this to the 1.5 roadmap.
[~peter.prettenhofer] If you are still interested in this, please feel free to
take it. Or if others are interested, please comment on this JIRA.

The initial API should be quite simple; I'm imagining a single method returning
importance for each feature, modeled after what R or other libraries return.

I think we should calculate importance based on the learned model. The
permutation test would be nice in the future but would be much more expensive
(shuffling data).

Feature Importance for Decision Tree (Ensembles)

Key: SPARK-5133
URL: https://issues.apache.org/jira/browse/SPARK-5133
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Reporter: Peter Prettenhofer

Add feature importance to decision tree model and tree ensemble models.
If people are interested in this feature I could implement it given a mentor
(API decisions, etc). Please find a description of the feature below:
Decision trees intrinsically perform feature selection by selecting
appropriate split points. This information can be used to assess the relative
importance of a feature.
Relative feature importance gives valuable insight into a decision tree or
tree ensemble and can even be used for feature selection.
More information on feature importance (via decrease in impurity) can be
found in ESLII (10.13.1) or here [1].
R's randomForest package uses a different technique for assessing variable
importance that is based on permutation tests.
All necessary information to create relative importance scores should be
available in the tree representation (class Node; split, impurity gain,
(weighted) nr of samples?).
[1]
http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7244) Find vertex sequences satisfying predicates