date:20161027

[jira] [Commented] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-10-27 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614192#comment-15614192
 ] 

Song Jun commented on SPARK-14171:
--

This issue does not reproduce on the recently spark branch master ,because 
spark has implement its own percentile_approx(ApproximatePercentile.scala) not 
use the hive's, then there is no constant type check exception.

But there is another problem,when sql:
select percentile_approxy(key,0.9),count(distinct key),sume(distinc key) 
from src limit 1

detail in issue [SPARK-18137]

> UDAF aggregates argument object inspector not parsed correctly
> --
>
> Key: SPARK-14171
> URL: https://issues.apache.org/jira/browse/SPARK-14171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>Priority: Critical
>
> For example, when using percentile_approx and count distinct together, it 
> raises an error complaining the argument is not constant. We have a test case 
> to reproduce. Could you help look into a fix of this? This was working in 
> previous version (Spark 1.4 + Hive 0.13). Thanks!
> {code}--- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with 
> TestHiveSingleton with SQLTestUtils {
>  checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM 
> src LIMIT 1"),
>sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
> +
> +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
> key) FROM src LIMIT 1"),
> +  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
> }
>test("UDFIntegerToString") {
> {code}
> When running the test suite, we can see this error:
> {code}
> - Generic UDAF aggregates *** FAILED ***
>   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: 
> hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   ...
>   Cause: java.lang.reflect.InvocationTargetException:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
>   ...
>   Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
> argument must be a constant, but double was passed instead.
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
>   at 
>

[jira] [Assigned] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18137:


Assignee: (was: Apache Spark)

> RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable 
> TypeCheck
> --
>
> Key: SPARK-18137
> URL: https://issues.apache.org/jira/browse/SPARK-18137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>
> when run a sql with distinct(on spark github master branch), it throw 
> UnresolvedException.
> For example:
> run a test case on spark(branch master)  with sql:
> {noformat}
> SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
> FROM src LIMIT 1
> {noformat}
> and it throw exception:
> {noformat}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
> DOUBLE), CAST(0.9BD AS DOUBLE), 1)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
>

[jira] [Commented] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614176#comment-15614176
 ] 

Apache Spark commented on SPARK-18137:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15668

> RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable 
> TypeCheck
> --
>
> Key: SPARK-18137
> URL: https://issues.apache.org/jira/browse/SPARK-18137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>
> when run a sql with distinct(on spark github master branch), it throw 
> UnresolvedException.
> For example:
> run a test case on spark(branch master)  with sql:
> {noformat}
> SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
> FROM src LIMIT 1
> {noformat}
> and it throw exception:
> {noformat}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
> DOUBLE), CAST(0.9BD AS DOUBLE), 1)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
>

[jira] [Assigned] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18137:


Assignee: Apache Spark

> RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable 
> TypeCheck
> --
>
> Key: SPARK-18137
> URL: https://issues.apache.org/jira/browse/SPARK-18137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Assignee: Apache Spark
>
> when run a sql with distinct(on spark github master branch), it throw 
> UnresolvedException.
> For example:
> run a test case on spark(branch master)  with sql:
> {noformat}
> SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
> FROM src LIMIT 1
> {noformat}
> and it throw exception:
> {noformat}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
> DOUBLE), CAST(0.9BD AS DOUBLE), 1)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
>

[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-27 Thread Don Drake (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-16845:
--
Attachment: error.txt.zip

Does this generated code help in resolving this?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-27 Thread Don Drake (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614100#comment-15614100
 ] 

Don Drake commented on SPARK-16845:
---

I'm struggling to get a simple case created. 

I'm curious though, if I compile my .jar file using sbt with Spark 2.0.1 but 
use your compiled branch of Spark 2.1.0-SNAPSHOT as a run-time (spark-submit), 
would you expect it to work? 

When using your compile branch of Spark 2.1.0-SNAPSHOT and execute a 
spark-shell the test cases provided in this JIRA pass.  But my code fails.

Also, the error message says "grows beyond 64k" as the compiler error but the 
output generates over 400k of source code. I'll try to attach the exact error 
message java code.


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18055:

Attachment: test-jar_2.11-1.0.jar

This jar is built with file MyData.scala
{code}
case class MyData(array: Seq[MyData2])
case class MyData2(i: Int)
{code}

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18149) build side decision based on cbo

2016-10-27 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-18149:


 Summary: build side decision based on cbo
 Key: SPARK-18149
 URL: https://issues.apache.org/jira/browse/SPARK-18149
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Zhenhua Wang


Decide build side of HashJoin based on cbo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17079) broadcast decision based on cbo

2016-10-27 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17079:
-
Description: We decide if broadcast join should be used based on the 
cardinality and size of join input side rather than the initial size of the 
join base relation.   (was: We decide if broadcast join should be used based on 
the cardinality and size of join input side rather than the initial size of the 
join base relation. We also decide build side of HashJoin based on cbo.)

> broadcast decision based on cbo
> ---
>
> Key: SPARK-17079
> URL: https://issues.apache.org/jira/browse/SPARK-17079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We decide if broadcast join should be used based on the cardinality and size 
> of join input side rather than the initial size of the join base relation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18107:


Assignee: Apache Spark

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>Assignee: Apache Spark
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614009#comment-15614009
 ] 

Apache Spark commented on SPARK-18107:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15667

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18107:


Assignee: (was: Apache Spark)

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18148) Misleading Error Message for Aggregation Without Window/GroupBy

2016-10-27 Thread Pat McDonough (JIRA)

Pat McDonough created SPARK-18148:
-

 Summary: Misleading Error Message for Aggregation Without 
Window/GroupBy
 Key: SPARK-18148
 URL: https://issues.apache.org/jira/browse/SPARK-18148
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: Databricks
Reporter: Pat McDonough


The following error message points to a random column I'm not actually using in 
my query, making it hard to diagnose.

{code}
org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;
{code}

Note in the code below, I forgot to add {{.over(weeklyWindow)}} in the line for 
{{withColumn("user_count"...}}

{code}
spark.read.load("/some-data")
  .withColumn("date_dt", to_date($"date"))
  .withColumn("year", year($"date_dt"))
  .withColumn("week", weekofyear($"date_dt"))
  .withColumn("user_count", count($"userId"))
  .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
)
{code}

CC: [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread J.P Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613944#comment-15613944
 ] 

J.P Feng commented on SPARK-18107:
--

Ok, it sounds good, thanks! I would have a try later.

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846
 ] 

Lianhui Wang edited comment on SPARK-15616 at 10/28/16 12:54 AM:
-

Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later. I think you can try this PR to 
do broadcast join. Please tell me any feedback. Thanks.


was (Author: lianhuiwang):
Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613846#comment-15613846
 ] 

Lianhui Wang commented on SPARK-15616:
--

Yes, I think it can. But now the PR is based on branch 2.0, not branch master. 
I will update it based on branch master later.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption

2016-10-27 Thread Junjie Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613837#comment-15613837
 ] 

Junjie Chen commented on SPARK-13331:
-

Hi [~vanzin]

Do we need more review? 

> AES support for over-the-wire encryption
> 
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18121) Unable to query global temp views when hive support is enabled.

2016-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18121:

Assignee: Sunitha Kambhampati

>  Unable to query global temp views when hive support is enabled. 
> -
>
> Key: SPARK-18121
> URL: https://issues.apache.org/jira/browse/SPARK-18121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
>Assignee: Sunitha Kambhampati
> Fix For: 2.1.0
>
>
> Querying on a global temp view throws Table or view not found exception when 
> Hive support is enabled. 
> Testcase to reproduce the problem: 
> The test needs to run when hive support is enabled.  
> {code}
>   test("query global temp view") {
> val df = Seq(1).toDF("i1")
> df.createGlobalTempView("tbl1")
> checkAnswer(spark.sql("select * from global_temp.tbl1"), Row(1))
> spark.sql("drop view global_temp.tbl1")
>   }
> {code}
> Cause:
> HiveSessionCatalog.lookupRelation does not check for the global temp views. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18121) Unable to query global temp views when hive support is enabled.

2016-10-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18121.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15649
[https://github.com/apache/spark/pull/15649]

>  Unable to query global temp views when hive support is enabled. 
> -
>
> Key: SPARK-18121
> URL: https://issues.apache.org/jira/browse/SPARK-18121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Fix For: 2.1.0
>
>
> Querying on a global temp view throws Table or view not found exception when 
> Hive support is enabled. 
> Testcase to reproduce the problem: 
> The test needs to run when hive support is enabled.  
> {code}
>   test("query global temp view") {
> val df = Seq(1).toDF("i1")
> df.createGlobalTempView("tbl1")
> checkAnswer(spark.sql("select * from global_temp.tbl1"), Row(1))
> spark.sql("drop view global_temp.tbl1")
>   }
> {code}
> Cause:
> HiveSessionCatalog.lookupRelation does not check for the global temp views. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613802#comment-15613802
 ] 

Apache Spark commented on SPARK-11421:
--

User 'mariusvniekerk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15666

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-10-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17153:
-
Labels: release_notes releasenotes  (was: release_notes)

> [Structured streams] readStream ignores partition columns
> -
>
> Key: SPARK-17153
> URL: https://issues.apache.org/jira/browse/SPARK-17153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Dmitri Carpov
>Assignee: Liang-Chi Hsieh
>  Labels: release_notes, releasenotes
> Fix For: 2.0.2, 2.1.0
>
>
> When parquet files are persisted using partitions, spark's `readStream` 
> returns data with all `null`s for the partitioned columns.
> For example:
> {noformat}
> case class A(id: Int, value: Int)
> val data = spark.createDataset(Seq(
>   A(1, 1), 
>   A(2, 2), 
>   A(2, 3))
> )
> val url = "/mnt/databricks/test"
> data.write.partitionBy("id").parquet(url)
> {noformat}
> when data is read as stream:
> {noformat}
> spark.readStream.schema(spark.read.load(url).schema).parquet(url)
> {noformat}
> it reads:
> {noformat}
> id, value
> null, 1
> null, 2
> null, 3
> {noformat}
> A possible reason is `readStream` reads parquet files directly but when those 
> are stored the columns they are partitioned by are excluded from the file 
> itself. In the given example the parquet files contain `value` information 
> only since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613682#comment-15613682
 ] 

Ray Qiu edited comment on SPARK-18125 at 10/27/16 11:54 PM:


Try this in spark-shell:

case class Route(src: String, dest: String, cost: Int)
case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])

val ds = sc.parallelize(Array(
Route("a", "b", 1),
Route("a", "b", 2),
Route("a", "c", 2),
Route("a", "d", 10),
Route("b", "a", 1),
Route("b", "a", 5),
Route("b", "c", 6))
  ).toDF.as[Route]

val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
  .groupByKey(r => (r.src, r.dest))
  .reduceGroups { (g1: GroupedRoutes, g2:  GroupedRoutes) =>
GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
  }.map(_._2)

Same thing works fine in 2.0.0

On Thu, Oct 27, 2016 at 4:06 PM, Herman van Hovell (JIRA) 




-- 
Regards,
Ray



was (Author: rayqiu):
Try this in spark-shell:

case class Route(src: String, dest: String, cost: Int)
case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])

import spark.implicits._

val ds = sc.parallelize(Array(
Route("a", "b", 1),
Route("a", "b", 2),
Route("a", "c", 2),
Route("a", "d", 10),
Route("b", "a", 1),
Route("b", "a", 5),
Route("b", "c", 6))
  ).toDF.as[Route]

val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
  .groupByKey(r => (r.src, r.dest))
  .reduceGroups { (g1: GroupedRoutes, g2:  GroupedRoutes) =>
GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
  }.map(_._2)

Same thing works fine in 2.0.0

On Thu, Oct 27, 2016 at 4:06 PM, Herman van Hovell (JIRA) 




-- 
Regards,
Ray


> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ?

[jira] [Commented] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-10-27 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613705#comment-15613705
 ] 

Yin Huai commented on SPARK-17153:
--

This change needs a release note because {{spark.readStream.json('/a.file')}} 
(create a stream on a single file) will not work anymore 
(https://github.com/apache/spark/pull/14803/files#diff-e82a44dc550d2a0a92e44d1ec2ecabccR52).

> [Structured streams] readStream ignores partition columns
> -
>
> Key: SPARK-17153
> URL: https://issues.apache.org/jira/browse/SPARK-17153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Dmitri Carpov
>Assignee: Liang-Chi Hsieh
>  Labels: release_notes
> Fix For: 2.0.2, 2.1.0
>
>
> When parquet files are persisted using partitions, spark's `readStream` 
> returns data with all `null`s for the partitioned columns.
> For example:
> {noformat}
> case class A(id: Int, value: Int)
> val data = spark.createDataset(Seq(
>   A(1, 1), 
>   A(2, 2), 
>   A(2, 3))
> )
> val url = "/mnt/databricks/test"
> data.write.partitionBy("id").parquet(url)
> {noformat}
> when data is read as stream:
> {noformat}
> spark.readStream.schema(spark.read.load(url).schema).parquet(url)
> {noformat}
> it reads:
> {noformat}
> id, value
> null, 1
> null, 2
> null, 3
> {noformat}
> A possible reason is `readStream` reads parquet files directly but when those 
> are stored the columns they are partitioned by are excluded from the file 
> itself. In the given example the parquet files contain `value` information 
> only since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Qiu updated SPARK-18125:

Comment: was deleted

(was: Same thing works fine in 2.0.0





-- 
Regards,
Ray
)

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
> /* 084 */

[jira] [Comment Edited] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613682#comment-15613682
 ] 

Ray Qiu edited comment on SPARK-18125 at 10/27/16 11:54 PM:


Try this in spark-shell:

case class Route(src: String, dest: String, cost: Int)
case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])

import spark.implicits._

val ds = sc.parallelize(Array(
Route("a", "b", 1),
Route("a", "b", 2),
Route("a", "c", 2),
Route("a", "d", 10),
Route("b", "a", 1),
Route("b", "a", 5),
Route("b", "c", 6))
  ).toDF.as[Route]

val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
  .groupByKey(r => (r.src, r.dest))
  .reduceGroups { (g1: GroupedRoutes, g2:  GroupedRoutes) =>
GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
  }.map(_._2)

Same thing works fine in 2.0.0

On Thu, Oct 27, 2016 at 4:06 PM, Herman van Hovell (JIRA) 




-- 
Regards,
Ray



was (Author: rayqiu):
Try this in spark-shell:

case class Route(src: String, dest: String, cost: Int)
case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])

import spark.implicits._

val ds = sc.parallelize(Array(
Route("a", "b", 1),
Route("a", "b", 2),
Route("a", "c", 2),
Route("a", "d", 10),
Route("b", "a", 1),
Route("b", "a", 5),
Route("b", "c", 6))
  ).toDF.as[Route]

val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
  .groupByKey(r => (r.src, r.dest))
  .reduceGroups { (g1: GroupedRoutes, g2:  GroupedRoutes) =>
GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
  }.map(_._2)

On Thu, Oct 27, 2016 at 4:06 PM, Herman van Hovell (JIRA) 




-- 
Regards,
Ray


> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
>

[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-10-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17153:
-
Labels: release_notes  (was: )

> [Structured streams] readStream ignores partition columns
> -
>
> Key: SPARK-17153
> URL: https://issues.apache.org/jira/browse/SPARK-17153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Dmitri Carpov
>Assignee: Liang-Chi Hsieh
>  Labels: release_notes
> Fix For: 2.0.2, 2.1.0
>
>
> When parquet files are persisted using partitions, spark's `readStream` 
> returns data with all `null`s for the partitioned columns.
> For example:
> {noformat}
> case class A(id: Int, value: Int)
> val data = spark.createDataset(Seq(
>   A(1, 1), 
>   A(2, 2), 
>   A(2, 3))
> )
> val url = "/mnt/databricks/test"
> data.write.partitionBy("id").parquet(url)
> {noformat}
> when data is read as stream:
> {noformat}
> spark.readStream.schema(spark.read.load(url).schema).parquet(url)
> {noformat}
> it reads:
> {noformat}
> id, value
> null, 1
> null, 2
> null, 3
> {noformat}
> A possible reason is `readStream` reads parquet files directly but when those 
> are stored the columns they are partitioned by are excluded from the file 
> itself. In the given example the parquet files contain `value` information 
> only since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613704#comment-15613704
 ] 

Ray Qiu commented on SPARK-18125:
-

Same thing works fine in 2.0.0





-- 
Regards,
Ray


> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
> /*

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613682#comment-15613682
 ] 

Ray Qiu commented on SPARK-18125:
-

Try this in spark-shell:

case class Route(src: String, dest: String, cost: Int)
case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])

import spark.implicits._

val ds = sc.parallelize(Array(
Route("a", "b", 1),
Route("a", "b", 2),
Route("a", "c", 2),
Route("a", "d", 10),
Route("b", "a", 1),
Route("b", "a", 5),
Route("b", "c", 6))
  ).toDF.as[Route]

val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
  .groupByKey(r => (r.src, r.dest))
  .reduceGroups { (g1: GroupedRoutes, g2:  GroupedRoutes) =>
GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
  }.map(_._2)

On Thu, Oct 27, 2016 at 4:06 PM, Herman van Hovell (JIRA) 




-- 
Regards,
Ray


> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */

[jira] [Comment Edited] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-27 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613656#comment-15613656
 ] 

Emlyn Corrin edited comment on SPARK-16648 at 10/27/16 11:33 PM:
-

Since Spark 2.0.1, the following pyspark snippet fails (I believe it worked 
under 2.0.0, so this issue seems like the most likely cause of change in 
behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)

[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-27 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613656#comment-15613656
 ] 

Emlyn Corrin commented on SPARK-16648:
--

Since Spark 2.0.1, the following snippet fails (I believe it worked under 
2.0.0, so this issue seems like the most likely cause of change in behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at

[jira] [Assigned] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18146:


Assignee: Apache Spark

> Avoid using Union to chain together create table and repair partition commands
> --
>
> Key: SPARK-18146
> URL: https://issues.apache.org/jira/browse/SPARK-18146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> The behavior of union is not well defined here. We should add an internal 
> command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18146:


Assignee: (was: Apache Spark)

> Avoid using Union to chain together create table and repair partition commands
> --
>
> Key: SPARK-18146
> URL: https://issues.apache.org/jira/browse/SPARK-18146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> The behavior of union is not well defined here. We should add an internal 
> command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613584#comment-15613584
 ] 

Apache Spark commented on SPARK-18146:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15665

> Avoid using Union to chain together create table and repair partition commands
> --
>
> Key: SPARK-18146
> URL: https://issues.apache.org/jira/browse/SPARK-18146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
>
> The behavior of union is not well defined here. We should add an internal 
> command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613580#comment-15613580
 ] 

Herman van Hovell commented on SPARK-18125:
---

I tried something like this on master and on branch-2.0:
{noformat}
val ds = spark.range(1).select($"id" % 100 as "grp_id", 
array($"id")).as[(Long, Seq[Long])] 
ds.groupByKey(_._1).reduceGroups((a, b) => (a._1, a._2 ++ b._2)).map(_._2)
{noformat}

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();

[jira] [Updated] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18147:
-
Target Version/s: 2.1.0
Priority: Critical  (was: Minor)

> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Priority: Critical
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala.collection.Seq) value24.s2();
> [info] /* 043 */ isNull23 = value23 == null;
> [info] /* 044 */ argIsNulls1[0] = isNull23;
> [info] /* 045 */ argValue1 = value23;
> [info] /* 046 */
> [info] /* 047 */
> [info] /* 048 */
> [info] /* 049 */ boolean isNull22 = false;
> [info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
> [info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
> [info] /* 052 */ }
> [info] /* 053 */
> [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
> [info] /* 055 */ if (isNull22) {
> [info] /* 056 */   values1[2] = null;
> [info] /* 057 */ } else {
> [info] /* 058 */   values1[2] = value22;
> [info] /* 059 */ }
> [info] /* 060 */   }
> [info] /* 061 */
> [info] /* 062 */
> [info] /*

[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613555#comment-15613555
 ] 

Michael Armbrust commented on SPARK-18147:
--

/cc [~cloud_fan]

> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Priority: Minor
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala.collection.Seq) value24.s2();
> [info] /* 043 */ isNull23 = value23 == null;
> [info] /* 044 */ argIsNulls1[0] = isNull23;
> [info] /* 045 */ argValue1 = value23;
> [info] /* 046 */
> [info] /* 047 */
> [info] /* 048 */
> [info] /* 049 */ boolean isNull22 = false;
> [info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
> [info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
> [info] /* 052 */ }
> [info] /* 053 */
> [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
> [info] /* 055 */ if (isNull22) {
> [info] /* 056 */   values1[2] = null;
> [info] /* 057 */ } else {
> [info] /* 058 */   values1[2] = value22;
> [info] /* 059 */ }
> [info] /* 060 */   }
> [info] /* 061 */
> [info] /* 062 */
> [info] /* 063 */   private

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613497#comment-15613497
 ] 

Herman van Hovell commented on SPARK-18125:
---

Could one of you provide a reproducible example.

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 =

[jira] [Updated] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18125:
--
Description: 
Code logic looks like this:
{noformat}
.groupByKey
.reduceGroups
.map(_._2)
{noformat}
Works fine with 2.0.0.

2.0.1 error Message: 
{noformat}
Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: failed 
to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 206, Column 123: Unknown variable or type "value4"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificMutableProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificMutableProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private MutableRow mutableRow;
/* 009 */   private Object[] values;
/* 010 */   private java.lang.String errMsg;
/* 011 */   private java.lang.String errMsg1;
/* 012 */   private boolean MapObjects_loopIsNull1;
/* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
/* 014 */   private java.lang.String errMsg2;
/* 015 */   private Object[] values1;
/* 016 */   private boolean MapObjects_loopIsNull3;
/* 017 */   private java.lang.String MapObjects_loopValue2;
/* 018 */   private boolean isNull_0;
/* 019 */   private boolean value_0;
/* 020 */   private boolean isNull_1;
/* 021 */   private InternalRow value_1;
/* 022 */
/* 023 */   private void apply_4(InternalRow i) {
/* 024 */
/* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
/* 026 */ final double value52 = isNull52 ? -1.0 : 
MapObjects_loopValue0.ts();
/* 027 */ if (isNull52) {
/* 028 */   values1[8] = null;
/* 029 */ } else {
/* 030 */   values1[8] = value52;
/* 031 */ }
/* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
/* 033 */ final java.lang.String value54 = isNull54 ? null : 
(java.lang.String) MapObjects_loopValue0.uid();
/* 034 */ isNull54 = value54 == null;
/* 035 */ boolean isNull53 = isNull54;
/* 036 */ final UTF8String value53 = isNull53 ? null : 
org.apache.spark.unsafe.types.UTF8String.fromString(value54);
/* 037 */ isNull53 = value53 == null;
/* 038 */ if (isNull53) {
/* 039 */   values1[9] = null;
/* 040 */ } else {
/* 041 */   values1[9] = value53;
/* 042 */ }
/* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
/* 044 */ final java.lang.String value56 = isNull56 ? null : 
(java.lang.String) MapObjects_loopValue0.src();
/* 045 */ isNull56 = value56 == null;
/* 046 */ boolean isNull55 = isNull56;
/* 047 */ final UTF8String value55 = isNull55 ? null : 
org.apache.spark.unsafe.types.UTF8String.fromString(value56);
/* 048 */ isNull55 = value55 == null;
/* 049 */ if (isNull55) {
/* 050 */   values1[10] = null;
/* 051 */ } else {
/* 052 */   values1[10] = value55;
/* 053 */ }
/* 054 */   }
/* 055 */
/* 056 */
/* 057 */   private void apply_7(InternalRow i) {
/* 058 */
/* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
/* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
MapObjects_loopValue0.orig_bytes();
/* 061 */ isNull69 = value69 == null;
/* 062 */
/* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
/* 064 */ long value68 = isNull68 ?
/* 065 */ -1L : (Long) value69.get();
/* 066 */ if (isNull68) {
/* 067 */   values1[17] = null;
/* 068 */ } else {
/* 069 */   values1[17] = value68;
/* 070 */ }
/* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
/* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
MapObjects_loopValue0.resp_bytes();
/* 073 */ isNull71 = value71 == null;
/* 074 */
/* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
/* 076 */ long value70 = isNull70 ?
/* 077 */ -1L : (Long) value71.get();
/* 078 */ if (isNull70) {
/* 079 */   values1[18] = null;
/* 080 */ } else {
/* 081 */   values1[18] = value70;
/* 082 */ }
/* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
/* 084 */ final scala.Option value74 = isNull74 ? null : (scala.Option) 
MapObjects_loopValue0.conn_state();
/* 085 */ isNull74 = value74 == null;
/* 086 */
/* 087 */ final boolean isNull73 = isNull74 || value74.isEmpty();
/* 088 */ java.lang.String value73 = isNull73 ?
/* 089 */ null : (java.lang.String) value74.get();
/* 090 */ boolean isNull72 = isNull73;
/* 091 */ final UTF8String value72 = isNull72 ? null : 
org.apache.spark.unsafe.types.UTF8String.fromString(value73);
/* 092 */ isNull72 = value72 == null;
/* 093 */ if (isNull72) {
/* 094 */   values1[19] = null;
/* 095 */ } else {
/* 096 */   values1[19] = value72;
/* 097 */ }
/* 098 */   }
/* 099 */
/* 100 */
/*

[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613472#comment-15613472
 ] 

koert kuipers commented on SPARK-18147:
---

it also breaks with an option of a case class. like this:
{noformat}
 val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
"y").groupBy("x").agg(
   new ComplexResultAgg("boo", Option(Struct2())).toColumn
 )
{noformat}

> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Priority: Minor
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala.collection.Seq) value24.s2();
> [info] /* 043 */ isNull23 = value23 == null;
> [info] /* 044 */ argIsNulls1[0] = isNull23;
> [info] /* 045 */ argValue1 = value23;
> [info] /* 046 */
> [info] /* 047 */
> [info] /* 048 */
> [info] /* 049 */ boolean isNull22 = false;
> [info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
> [info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
> [info] /* 052 */ }
> [info] /* 053 */
> [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
> [info] /* 055 */ if (isNull22) {
> [info] /* 056 */

[jira] [Updated] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread koert kuipers (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-18147:
--
Description: 
this is me on purpose trying to break spark sql codegen to uncover potential 
issues, by creating arbitrately complex data structures using primitives, 
strings, basic collections (map, seq, option), tuples, and case classes.

first example: nested case classes
code:
{noformat}
class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) extends 
Aggregator[Row, B, C] {
  override def reduce(b: B, input: Row): B = b

  override def merge(b1: B, b2: B): B = b1

  override def finish(reduction: B): C = result

  override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
  override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
}

case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: Seq[Long] 
= Seq.empty)

case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())

val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
"y").groupBy("x").agg(
  new ComplexResultAgg("boo", Struct3()).toColumn
)
df1.printSchema
df1.show
{noformat}

the result is:
{noformat}
[info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
[info] /* 001 */ public java.lang.Object generate(Object[] references) {
[info] /* 002 */   return new SpecificMutableProjection(references);
[info] /* 003 */ }
[info] /* 004 */
[info] /* 005 */ class SpecificMutableProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
[info] /* 006 */
[info] /* 007 */   private Object[] references;
[info] /* 008 */   private MutableRow mutableRow;
[info] /* 009 */   private Object[] values;
[info] /* 010 */   private java.lang.String errMsg;
[info] /* 011 */   private Object[] values1;
[info] /* 012 */   private java.lang.String errMsg1;
[info] /* 013 */   private boolean[] argIsNulls;
[info] /* 014 */   private scala.collection.Seq argValue;
[info] /* 015 */   private java.lang.String errMsg2;
[info] /* 016 */   private boolean[] argIsNulls1;
[info] /* 017 */   private scala.collection.Seq argValue1;
[info] /* 018 */   private java.lang.String errMsg3;
[info] /* 019 */   private java.lang.String errMsg4;
[info] /* 020 */   private Object[] values2;
[info] /* 021 */   private java.lang.String errMsg5;
[info] /* 022 */   private boolean[] argIsNulls2;
[info] /* 023 */   private scala.collection.Seq argValue2;
[info] /* 024 */   private java.lang.String errMsg6;
[info] /* 025 */   private boolean[] argIsNulls3;
[info] /* 026 */   private scala.collection.Seq argValue3;
[info] /* 027 */   private java.lang.String errMsg7;
[info] /* 028 */   private boolean isNull_0;
[info] /* 029 */   private InternalRow value_0;
[info] /* 030 */
[info] /* 031 */   private void apply_1(InternalRow i) {
[info] /* 032 */
[info] /* 033 */ if (isNull1) {
[info] /* 034 */   throw new RuntimeException(errMsg3);
[info] /* 035 */ }
[info] /* 036 */
[info] /* 037 */ boolean isNull24 = false;
[info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
null : (com.tresata.spark.sql.Struct2) value1.a();
[info] /* 039 */ isNull24 = value24 == null;
[info] /* 040 */
[info] /* 041 */ boolean isNull23 = isNull24;
[info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
(scala.collection.Seq) value24.s2();
[info] /* 043 */ isNull23 = value23 == null;
[info] /* 044 */ argIsNulls1[0] = isNull23;
[info] /* 045 */ argValue1 = value23;
[info] /* 046 */
[info] /* 047 */
[info] /* 048 */
[info] /* 049 */ boolean isNull22 = false;
[info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
[info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
[info] /* 052 */ }
[info] /* 053 */
[info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
[info] /* 055 */ if (isNull22) {
[info] /* 056 */   values1[2] = null;
[info] /* 057 */ } else {
[info] /* 058 */   values1[2] = value22;
[info] /* 059 */ }
[info] /* 060 */   }
[info] /* 061 */
[info] /* 062 */
[info] /* 063 */   private void apply1_1(InternalRow i) {
[info] /* 064 */
[info] /* 065 */ if (isNull1) {
[info] /* 066 */   throw new RuntimeException(errMsg7);
[info] /* 067 */ }
[info] /* 068 */
[info] /* 069 */ boolean isNull41 = false;
[info] /* 070 */ final com.tresata.spark.sql.Struct2 value41 = isNull41 ? 
null : (com.tresata.spark.sql.Struct2) value1.b();
[info] /* 071 */ isNull41 = value41 == null;
[info] /* 072 */
[info] /* 073 */ boolean isNull40 = isNull41;
[info] /* 074 */ final scala.collection.Seq value40 = isNull40 ? null :

[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

2016-10-27 Thread Sammie Durugo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613459#comment-15613459
 ] 

Sammie Durugo commented on SPARK-11046:
---

I'm not sure anyone has noticed that nested schema cannot be passed to dapply 
and array type cannot by declared just like you would do with "integer", 
"string", and "double" when defining a schema using structType. I think it will 
be useful to be able to declare array type when using dapply as most R outputs 
take the form of an R list object. For example, suppose the R output takes the 
following form:

output = list(bd = array(..., dim = c(d1, d2, d3), 
dd = matrix(..., nr, nc), 
cp = list(a = matrix(..., nr, nc), 
  b = vector(...)) ),

in order to define a schema to pass to dapply in the above context, one should 
have the liberty to define the schema with the following form (if possible):

schema = structType(structField("bd", "array"), 
   structField("dd", "array"), 
   structField("cp", 
structType(structField("a", "array"), 
   
structField("b", "double") ) ) ),

which may look like this (if possible):

StructType
|-name = "bd", type = "ArrayType", nullable = TRUE
|-name = "dd", type = "ArrayType", nullable = TRUE
|-name = "cp", type = "ArrayType", nullable = TRUE
   |-name = "a", type = "ArrayType", nullable = TRUE
   |-name = "b", type = "double", nullable = TRUE

At the moment, only character type is allowed for data type parameter within 
structField. But by relaxing this condition and allowing the flexibility to 
pass in a structType inside an existing structType, the above structure can be 
accommodated easily. Also, you should allow R list objects, which are very 
close to R array objects by design, to be mapped into spark's ArrayType.

Having to use the default setting of 'schema = NULL' in the dapply, which 
leaves the output as bytes should be the very last resort. Thank you for your 
help with this.

Sammie.

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18123:


Assignee: (was: Apache Spark)

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613453#comment-15613453
 ] 

Apache Spark commented on SPARK-18123:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15664

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18123:


Assignee: Apache Spark

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>Assignee: Apache Spark
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18147) Broken Spark SQL Codegen

2016-10-27 Thread koert kuipers (JIRA)

koert kuipers created SPARK-18147:
-

 Summary: Broken Spark SQL Codegen
 Key: SPARK-18147
 URL: https://issues.apache.org/jira/browse/SPARK-18147
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: koert kuipers
Priority: Minor


this is me on purpose trying to break spark sql codegen to uncover potential 
issues, by creating arbitrately complex data structures using primitives, 
strings, basic collections, tuples, and case classes.

first example: nested case classes
code:
{noformat}
class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) extends 
Aggregator[Row, B, C] {
  override def reduce(b: B, input: Row): B = b

  override def merge(b1: B, b2: B): B = b1

  override def finish(reduction: B): C = result

  override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
  override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
}

case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: Seq[Long] 
= Seq.empty)

case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())

val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
"y").groupBy("x").agg(
  new ComplexResultAgg("boo", Struct3()).toColumn
)
df1.printSchema
df1.show
{noformat}

the result is:
{noformat}
[info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
[info] /* 001 */ public java.lang.Object generate(Object[] references) {
[info] /* 002 */   return new SpecificMutableProjection(references);
[info] /* 003 */ }
[info] /* 004 */
[info] /* 005 */ class SpecificMutableProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
[info] /* 006 */
[info] /* 007 */   private Object[] references;
[info] /* 008 */   private MutableRow mutableRow;
[info] /* 009 */   private Object[] values;
[info] /* 010 */   private java.lang.String errMsg;
[info] /* 011 */   private Object[] values1;
[info] /* 012 */   private java.lang.String errMsg1;
[info] /* 013 */   private boolean[] argIsNulls;
[info] /* 014 */   private scala.collection.Seq argValue;
[info] /* 015 */   private java.lang.String errMsg2;
[info] /* 016 */   private boolean[] argIsNulls1;
[info] /* 017 */   private scala.collection.Seq argValue1;
[info] /* 018 */   private java.lang.String errMsg3;
[info] /* 019 */   private java.lang.String errMsg4;
[info] /* 020 */   private Object[] values2;
[info] /* 021 */   private java.lang.String errMsg5;
[info] /* 022 */   private boolean[] argIsNulls2;
[info] /* 023 */   private scala.collection.Seq argValue2;
[info] /* 024 */   private java.lang.String errMsg6;
[info] /* 025 */   private boolean[] argIsNulls3;
[info] /* 026 */   private scala.collection.Seq argValue3;
[info] /* 027 */   private java.lang.String errMsg7;
[info] /* 028 */   private boolean isNull_0;
[info] /* 029 */   private InternalRow value_0;
[info] /* 030 */
[info] /* 031 */   private void apply_1(InternalRow i) {
[info] /* 032 */
[info] /* 033 */ if (isNull1) {
[info] /* 034 */   throw new RuntimeException(errMsg3);
[info] /* 035 */ }
[info] /* 036 */
[info] /* 037 */ boolean isNull24 = false;
[info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
null : (com.tresata.spark.sql.Struct2) value1.a();
[info] /* 039 */ isNull24 = value24 == null;
[info] /* 040 */
[info] /* 041 */ boolean isNull23 = isNull24;
[info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
(scala.collection.Seq) value24.s2();
[info] /* 043 */ isNull23 = value23 == null;
[info] /* 044 */ argIsNulls1[0] = isNull23;
[info] /* 045 */ argValue1 = value23;
[info] /* 046 */
[info] /* 047 */
[info] /* 048 */
[info] /* 049 */ boolean isNull22 = false;
[info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
[info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
[info] /* 052 */ }
[info] /* 053 */
[info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
[info] /* 055 */ if (isNull22) {
[info] /* 056 */   values1[2] = null;
[info] /* 057 */ } else {
[info] /* 058 */   values1[2] = value22;
[info] /* 059 */ }
[info] /* 060 */   }
[info] /* 061 */
[info] /* 062 */
[info] /* 063 */   private void apply1_1(InternalRow i) {
[info] /* 064 */
[info] /* 065 */ if (isNull1) {
[info] /* 066 */   throw new RuntimeException(errMsg7);
[info] /* 067 */ }
[info] /* 068 */
[info] /* 069 */ boolean isNull41 = false;
[info] /* 070 */ final com.tresata.spark.sql.Struct2 value41 = isNull41 ? 
null : (com.tresata.spark.sql.Struct2) value1.b();
[info] /* 071 */ isNull41 = value41 == null;

[jira] [Created] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-27 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18146:
--

 Summary: Avoid using Union to chain together create table and 
repair partition commands
 Key: SPARK-18146
 URL: https://issues.apache.org/jira/browse/SPARK-18146
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang
Priority: Minor


The behavior of union is not well defined here. We should add an internal 
command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18145) Update documentation

2016-10-27 Thread Eric Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-17861

> Update documentation
> 
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-10-27 Thread Eric Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Component/s: SQL

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-10-27 Thread Eric Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Summary: Update documentation for hive partition management in 2.1  (was: 
Update documentation)

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18145) Update documentation

2016-10-27 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18145:
--

 Summary: Update documentation
 Key: SPARK-18145
 URL: https://issues.apache.org/jira/browse/SPARK-18145
 Project: Spark
  Issue Type: Documentation
Reporter: Eric Liang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17970) Use metastore for managing filesource table partitions as well

2016-10-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17970.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15515
[https://github.com/apache/spark/pull/15515]

> Use metastore for managing filesource table partitions as well
> --
>
> Key: SPARK-17970
> URL: https://issues.apache.org/jira/browse/SPARK-17970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-27 Thread Tyson Condie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613253#comment-15613253
 ] 

Tyson Condie commented on SPARK-17829:
--

Thanks Code for the clarification. My background is mostly in Java and I 
appreciate you pointing out this alternative solution.

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17891) SQL-based three column join loses first column

2016-10-27 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-17891:

Comment: was deleted

(was: *Workaround:*
# Disable BroadcastHashJoin  by setting 
{{spark.sql.autoBroadcastJoinThreshold=-1}}
# Convert join keys to StringType )

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10561) Provide tooling for auto-generating Spark SQL reference manual

2016-10-27 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10561:
---
Labels: tool  (was: )

> Provide tooling for auto-generating Spark SQL reference manual
> --
>
> Key: SPARK-10561
> URL: https://issues.apache.org/jira/browse/SPARK-10561
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Reporter: Ted Yu
>  Labels: tool
>
> Here is the discussion thread:
> http://search-hadoop.com/m/q3RTtcD20F1o62xE
> Richard Hillegas made the following suggestion:
> A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
> to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
> support provided by the Scala language. I am new to programming in Scala, so 
> I don't know whether the Scala ecosystem provides any good tools for 
> reverse-engineering a BNF from a class which extends 
> scala.util.parsing.combinator.syntactical.StandardTokenParsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16137) Random Forest wrapper in SparkR

2016-10-27 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16137:
--
Shepherd:   (was: Joseph K. Bradley)

> Random Forest wrapper in SparkR
> ---
>
> Key: SPARK-16137
> URL: https://issues.apache.org/jira/browse/SPARK-16137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Kai Jiang
>
> Implement a wrapper in SparkR to support Random Forest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18109) Log instrumentation in GMM

2016-10-27 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18109:
--
Shepherd: Joseph K. Bradley

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18109) Log instrumentation in GMM

2016-10-27 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18109:
--
Assignee: zhengruifeng

> Log instrumentation in GMM
> --
>
> Key: SPARK-18109
> URL: https://issues.apache.org/jira/browse/SPARK-18109
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>
> Add log instrumentation in GMM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613099#comment-15613099
 ] 

Ray Qiu commented on SPARK-18125:
-

Move to Priority Critical unless a workaround is identified.  This is a very 
basic functionality.

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> .groupByKey
> .reduceGroups
> .map(_._2)
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;

[jira] [Updated] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-27 Thread Ray Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Qiu updated SPARK-18125:

Priority: Critical  (was: Major)

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> .groupByKey
> .reduceGroups
> .map(_._2)
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
> /* 084 */ final scala.Option value74 = isNull74 ? null : (scala.Option) 
>

[jira] [Comment Edited] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613088#comment-15613088
 ] 

Dongjoon Hyun edited comment on SPARK-18123 at 10/27/16 8:20 PM:
-

Hi, [~zwu@gmail.com].

There are two related things about this. Actually, current implementation 
supports the followings.
- Allow reserved keyword as a field name, e.g., 'order'.
- Allow mixed-case data frame like the following, e.g., `[a: int, A: int]`.
{code}
scala> val df = sql("select 1 a, 1 A")
df: org.apache.spark.sql.DataFrame = [a: int, A: int]
scala> val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> 
"mixed", "user" -> "postgres", "password" -> "test")
scala> df.write.mode("overwrite").format("jdbc").options(option).save()
scala> df.write.mode("append").format("jdbc").options(option).save()
{code}

Your case should be supported without any regression the above cases. Your case 
might be the following.
{code}
val df1 = sql("select 1 a")
val df2 = sql("select 1 A")

val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> "tx", "user" 
-> "postgres", "password" -> "test")

df1.write.mode("overwrite").format("jdbc").options(option).save()
df2.write.mode("append").format("jdbc").options(option).save()
{code}

I'm testing more cases to solve all three cases including yours.


was (Author: dongjoon):
Hi, [~zwu@gmail.com].

There is two related things about this. Actually, current implementation 
supports the followings.
- Allow reserved keyword as a field name, e.g., 'order'.
- Allow mixed-case data frame like the following.
{code}
scala> val df = sql("select 1 a, 1 A")
df: org.apache.spark.sql.DataFrame = [a: int, A: int]
scala> val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> 
"mixed", "user" -> "postgres", "password" -> "test")
scala> df.write.mode("overwrite").format("jdbc").options(option).save()
scala> df.write.mode("append").format("jdbc").options(option).save()
{code}

Your case should be supported without any regression the above cases. Your case 
might be the following.
{code}
val df1 = sql("select 1 a")
val df2 = sql("select 1 A")

val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> "tx", "user" 
-> "postgres", "password" -> "test")

df1.write.mode("overwrite").format("jdbc").options(option).save()
df2.write.mode("append").format("jdbc").options(option).save()
{code}

I'm testing more cases to solve all three cases including yours.

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613088#comment-15613088
 ] 

Dongjoon Hyun commented on SPARK-18123:
---

Hi, [~zwu@gmail.com].

There is two related things about this. Actually, current implementation 
supports the followings.
- Allow reserved keyword as a field name, e.g., 'order'.
- Allow mixed-case data frame like the following.
{code}
scala> val df = sql("select 1 a, 1 A")
df: org.apache.spark.sql.DataFrame = [a: int, A: int]
scala> val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> 
"mixed", "user" -> "postgres", "password" -> "test")
scala> df.write.mode("overwrite").format("jdbc").options(option).save()
scala> df.write.mode("append").format("jdbc").options(option).save()
{code}

Your case should be supported without any regression the above cases. Your case 
might be the following.
{code}
val df1 = sql("select 1 a")
val df2 = sql("select 1 A")

val option = Map("url" -> "jdbc:postgresql:postgres", "dbtable" -> "tx", "user" 
-> "postgres", "password" -> "test")

df1.write.mode("overwrite").format("jdbc").options(option).save()
df2.write.mode("append").format("jdbc").options(option).save()
{code}

I'm testing more cases to solve all three cases including yours.

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-27 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18144:
-
Affects Version/s: 2.0.0
   2.0.1

> StreamingQueryListener.QueryStartedEvent is not written to event log
> 
>
> Key: SPARK-18144
> URL: https://issues.apache.org/jira/browse/SPARK-18144
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18144) StreamingQueryListener.QueryStartedEvent is not written to event log

2016-10-27 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18144:


 Summary: StreamingQueryListener.QueryStartedEvent is not written 
to event log
 Key: SPARK-18144
 URL: https://issues.apache.org/jira/browse/SPARK-18144
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18143:


Assignee: (was: Apache Spark)

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612998#comment-15612998
 ] 

Apache Spark commented on SPARK-18143:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15663

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18143:


Assignee: Apache Spark

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-27 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18143:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-8360

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-27 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18143:


 Summary: History Server is broken because of the refactoring work 
in Structured Streaming
 Key: SPARK-18143
 URL: https://issues.apache.org/jira/browse/SPARK-18143
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.2
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937
 ] 

Franck Tago edited comment on SPARK-15616 at 10/27/16 7:35 PM:
---

Hi 

In my case  the filter is on a partition key , so  i would  need your changes 
in order to see a  broadcast join ?


was (Author: tafra...@gmail.com):
Hi 

In my case  the filter is on a partition key , so  i would  need these changes 
in order to see a  broadcast join ?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937
 ] 

Franck Tago commented on SPARK-15616:
-

Hi 

In my case  the filter is on a partition key , so  i would  need these changes 
in order to see a  broadcast join ?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18142) Spark Master tries to launch workers 145 times within 1 minute

2016-10-27 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18142:
---

 Summary: Spark Master tries to launch workers 145 times within 1 
minute
 Key: SPARK-18142
 URL: https://issues.apache.org/jira/browse/SPARK-18142
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
Reporter: Burak Yavuz


I observed a case where an instance running a worker was killed. The Spark 
Master tried to launch new executors at that instance, even though the instance 
didn't exist anymore and failed 145 times within 1 minute, and then killed the 
application.

The instance takes ~10 minutes to be replaced. The master should at least have 
an exponential backoff mechanism when performing these retries so that it gives 
the infrastructure time to recover.

{code}
16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/3 
because it is EXITED
16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-/4 on 
worker worker-20161027124917-10.0.43.232-60886
16/10/27 17:31:18 WARN Master: Got status update for unknown executor 
app-20161027124929-/3
16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/4 
because it is FAILED
16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-/5 on 
worker worker-20161027124917-10.0.43.232-60886
16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/5 
because it is FAILED

...

16/10/27 17:31:37 INFO Master: 10.0.70.32:32829 got disassociated, removing it.
16/10/27 17:31:37 INFO Master: 10.0.70.32:40523 got disassociated, removing it.
16/10/27 17:31:37 INFO Master: Removing worker 
worker-20161027124917-10.0.70.32-40523 on 10.0.70.32:40523
16/10/27 17:31:37 INFO Master: Telling app of lost executor: 147
16/10/27 17:32:30 INFO Master: Removing executor app-20161027124929-/0 
because it is FAILED
16/10/27 17:32:30 ERROR Master: Application  with ID 
app-20161027124929- failed 145 times; removing it
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18142) Spark Master tries to launch workers 145 times within 1 minute

2016-10-27 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18142:

Component/s: Spark Core

> Spark Master tries to launch workers 145 times within 1 minute
> --
>
> Key: SPARK-18142
> URL: https://issues.apache.org/jira/browse/SPARK-18142
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>
> I observed a case where an instance running a worker was killed. The Spark 
> Master tried to launch new executors at that instance, even though the 
> instance didn't exist anymore and failed 145 times within 1 minute, and then 
> killed the application.
> The instance takes ~10 minutes to be replaced. The master should at least 
> have an exponential backoff mechanism when performing these retries so that 
> it gives the infrastructure time to recover.
> {code}
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/3 
> because it is EXITED
> 16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-/4 
> on worker worker-20161027124917-10.0.43.232-60886
> 16/10/27 17:31:18 WARN Master: Got status update for unknown executor 
> app-20161027124929-/3
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/4 
> because it is FAILED
> 16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-/5 
> on worker worker-20161027124917-10.0.43.232-60886
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-/5 
> because it is FAILED
> ...
> 16/10/27 17:31:37 INFO Master: 10.0.70.32:32829 got disassociated, removing 
> it.
> 16/10/27 17:31:37 INFO Master: 10.0.70.32:40523 got disassociated, removing 
> it.
> 16/10/27 17:31:37 INFO Master: Removing worker 
> worker-20161027124917-10.0.70.32-40523 on 10.0.70.32:40523
> 16/10/27 17:31:37 INFO Master: Telling app of lost executor: 147
> 16/10/27 17:32:30 INFO Master: Removing executor app-20161027124929-/0 
> because it is FAILED
> 16/10/27 17:32:30 ERROR Master: Application  with ID 
> app-20161027124929- failed 145 times; removing it
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17219) QuantileDiscretizer should handle NaN values gracefully

2016-10-27 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17219.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15428
[https://github.com/apache/spark/pull/15428]

> QuantileDiscretizer should handle NaN values gracefully
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18141:


Assignee: (was: Apache Spark)

> jdbc datasource read fails when  quoted  columns (eg:mixed case, reserved 
> words) in source table are used  in the filter.
> -
>
> Key: SPARK-18141
> URL: https://issues.apache.org/jira/browse/SPARK-18141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Suresh Thalamati
>
> create table t1("Name" text, "Id" integer)
> insert into t1 values('Mike', 1)
> val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties) 
> df.filter("Id = 1").show() 
> df.filter("`Id` = 1").show()
> Error :
> Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
>   Position: 35
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>   at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>   at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>   at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-10-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18141:


Assignee: Apache Spark

> jdbc datasource read fails when  quoted  columns (eg:mixed case, reserved 
> words) in source table are used  in the filter.
> -
>
> Key: SPARK-18141
> URL: https://issues.apache.org/jira/browse/SPARK-18141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Suresh Thalamati
>Assignee: Apache Spark
>
> create table t1("Name" text, "Id" integer)
> insert into t1 values('Mike', 1)
> val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties) 
> df.filter("Id = 1").show() 
> df.filter("`Id` = 1").show()
> Error :
> Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
>   Position: 35
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>   at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>   at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>   at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612762#comment-15612762
 ] 

Apache Spark commented on SPARK-18141:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/15662

> jdbc datasource read fails when  quoted  columns (eg:mixed case, reserved 
> words) in source table are used  in the filter.
> -
>
> Key: SPARK-18141
> URL: https://issues.apache.org/jira/browse/SPARK-18141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Suresh Thalamati
>
> create table t1("Name" text, "Id" integer)
> insert into t1 values('Mike', 1)
> val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties) 
> df.filter("Id = 1").show() 
> df.filter("`Id` = 1").show()
> Error :
> Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
>   Position: 35
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>   at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>   at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>   at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-27 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612686#comment-15612686
 ] 

Davies Liu commented on SPARK-18105:


It turned out that the bug in LZ4 is a false alarm, so close the upstream issue.

Can't reproduce the behavior now.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-27 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18105:
---
Priority: Major  (was: Blocker)

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-10-27 Thread Suresh Thalamati (JIRA)

Suresh Thalamati created SPARK-18141:


 Summary: jdbc datasource read fails when  quoted  columns 
(eg:mixed case, reserved words) in source table are used  in the filter.
 Key: SPARK-18141
 URL: https://issues.apache.org/jira/browse/SPARK-18141
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Suresh Thalamati


create table t1("Name" text, "Id" integer)
insert into t1 values('Mike', 1)

val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties) 
df.filter("Id = 1").show() 
df.filter("`Id` = 1").show()


Error :
Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
  Position: 35
  at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
  at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
  at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
  at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
  at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
  at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-27 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612663#comment-15612663
 ] 

Cody Koeninger commented on SPARK-17935:


So the main thing to point out is that Kafka producers currently aren't 
idempotent, so this sink can't be fault-tolerant.

Regarding the design doc, couple of comments

- KafkaSinkRDD  Why is this necessary?  Seems like KafkaSink should do 
basically the same as existing ForeachSink class

- CachedKafkaProducer  Why is this necessary?  A singleton producer per JVM is 
generally what's recommended by kafka docs.




> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-10-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612609#comment-15612609
 ] 

Apache Spark commented on SPARK-16963:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15661

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Frederick Reiss
>Assignee: Frederick Reiss
> Fix For: 2.0.3, 2.1.0
>
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17813) Maximum data per trigger

2016-10-27 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17813.
--
   Resolution: Fixed
 Assignee: Cody Koeninger
Fix Version/s: 2.1.0
   2.0.3

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
> Fix For: 2.0.3, 2.1.0
>
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16078) from_utc_timestamp/to_utc_timestamp may give different result in different timezone

2016-10-27 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16078:
---
Fix Version/s: 1.6.3

> from_utc_timestamp/to_utc_timestamp may give different result in different 
> timezone
> ---
>
> Key: SPARK-16078
> URL: https://issues.apache.org/jira/browse/SPARK-16078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.3, 2.0.0
>
>
> from_utc_timestamp/to_utc_timestamp should return determistic result in any 
> timezone (system default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications

2016-10-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18085:
---
Summary: Better History Server scalability for many / large applications  
(was: Scalability enhancements for the History Server)

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-10-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17822:
-
Target Version/s: 2.0.3, 2.1.0  (was: 2.0.2, 2.1.0)

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Seems it is pretty easy to remove objects from JVMObjectTracker.objMap. So, 
> seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17823) Make JVMObjectTracker.objMap thread-safe

2016-10-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17823:
-
Target Version/s: 2.0.3, 2.1.0  (was: 2.0.2, 2.1.0)

> Make JVMObjectTracker.objMap thread-safe
> 
>
> Key: SPARK-17823
> URL: https://issues.apache.org/jira/browse/SPARK-17823
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Since JVMObjectTracker.objMap is a global map, it makes sense to make it 
> thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18123) org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable the case senstivity issue

2016-10-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612167#comment-15612167
 ] 

Dongjoon Hyun commented on SPARK-18123:
---

Thank you. I'll make a PR to fix this.

> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable  the case 
> senstivity issue
> --
>
> Key: SPARK-18123
> URL: https://issues.apache.org/jira/browse/SPARK-18123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Paul Wu
>
> Blindly quoting every field name for inserting is the issue (Line 110-119, 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala).
> /**
>* Returns a PreparedStatement that inserts a row into table via conn.
>*/
>   def insertStatement(conn: Connection, table: String, rddSchema: StructType, 
> dialect: JdbcDialect)
>   : PreparedStatement = {
> val columns = rddSchema.fields.map(x => 
> dialect.quoteIdentifier(x.name)).mkString(",")
> val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
> val sql = s"INSERT INTO $table ($columns) VALUES ($placeholders)"
> conn.prepareStatement(sql)
>   }
> This code causes the following issue (it does not happen to 1.6.x):
> I have issue with the saveTable method in Spark 2.0/2.0.1. I tried to save a 
> dataset to Oracle database, but the fields must be uppercase to succeed. This 
> is not an expected behavior: If only the table names were quoted, this 
> utility should concern the case sensitivity.  The code below throws the 
> exception: Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: 
> "DATETIME_gmt": invalid identifier. 
> String detailSQL ="select CAST('2016-09-25 17:00:00' AS TIMESTAMP) 
> DATETIME_gmt, '1' NODEB";
> hc.sql("set spark.sql.caseSensitive=false");
> Dataset ds = hc.sql(detailSQL);
> ds.show();
> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, 
> detailTable, p);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2016-10-27 Thread Christian Zorneck (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612136#comment-15612136
 ] 

Christian Zorneck commented on SPARK-18134:
---

Spark was compatible in this case until Spark 1.4. How did this work until that 
version?
Maps always can be compared, at least if they are serialized, to achieve this.
I think, the user should decide if he uses maps, with the drawback of 
performance, because he needs this feature. The user could be warned of the 
performance issue with a warning, but the feature is important to have.

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Christian Zorneck
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-27 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612086#comment-15612086
 ] 

Thomas Graves commented on SPARK-18085:
---

Perhaps we can clarify the title on this jira to be something like history 
server cache storage scalability enhancements if its not going to address all 
of the scalability issues like app listing and streaming, etc.

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread J.P Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612047#comment-15612047
 ] 

J.P Feng commented on SPARK-18107:
--

Here is the execution logs of Hive 1.2.1, [Insert into]:

0: jdbc:hive2://master.mydata.com:23250> insert into table login4game 
partition(pt='mix_en',dt='2016-10-21')select distinct 
account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' 
as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  dt='2016-10-21' ;
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=
INFO  : number of splits:3
INFO  : Submitting tokens for job: job_1472611548204_72608
INFO  : The url to track the job: 
http://master.mydata.com:9378/proxy/application_1472611548204_72608/
INFO  : Starting Job = job_1472611548204_72608, Tracking URL = 
http://master.mydata.com:9378/proxy/application_1472611548204_72608/
INFO  : Kill Command = /usr/local/hadoop/bin/hadoop job  -kill 
job_1472611548204_72608
INFO  : Hadoop job information for Stage-1: number of mappers: 3; number of 
reducers: 1
INFO  : 2016-10-27 21:51:37,717 Stage-1 map = 0%,  reduce = 0%
INFO  : 2016-10-27 21:51:46,455 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 
3.17 sec
INFO  : 2016-10-27 21:51:48,576 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 16.16 sec
INFO  : 2016-10-27 21:51:56,945 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 22.7 sec
INFO  : MapReduce Total cumulative CPU time: 22 seconds 700 msec
INFO  : Ended Job = job_1472611548204_72608
INFO  : Loading data to table my_log.login4game partition (pt=mix_en, 
dt=2016-10-21) from 
hdfs://master.mydata.com:45660/data/warehouse/staging/.hive-staging_hive_2016-10-27_21-51-26_264_2085348807080462789-1/-ext-1
INFO  : Partition my_log.login4game{pt=mix_en, dt=2016-10-21} stats: 
[numFiles=2, numRows=132276, totalSize=971551, rawDataSize=82804776]

No rows affected (32.183 seconds)


> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612044#comment-15612044
 ] 

Liang-Chi Hsieh commented on SPARK-18107:
-

Looks like HIVE-11940 largely improves insert overwrite performance. I have the 
patch ready and I will submit a PR tomorrow. It will be good if you can test it 
to see it improves the performance then.

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread J.P Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612050#comment-15612050
 ] 

J.P Feng commented on SPARK-18107:
--

Here is the execution logs of Hive 2.0.1, [Insert overwrite]:

0: jdbc:hive2://master.mydata.com:23250> insert overwrite table login4game 
partition(pt='mix_en',dt='2016-10-21')select distinct 
account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' 
as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  dt='2016-10-21' ;
INFO  : Compiling 
command(queryId=hadoop_20161027215659_8a71044b-364c-4e41-a168-10213add0d5b): 
insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  
dt='2016-10-21'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, 
type:string, comment:null), FieldSchema(name:_col1, type:string, comment:null), 
FieldSchema(name:_col2, type:string, comment:null), FieldSchema(name:_col3, 
type:string, comment:null), FieldSchema(name:_col4, type:string, comment:null), 
FieldSchema(name:_col5, type:string, comment:null), FieldSchema(name:_col6, 
type:string, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hadoop_20161027215659_8a71044b-364c-4e41-a168-10213add0d5b); 
Time taken: 1.142 seconds
DEBUG : Encoding valid txns info 
109561:56926:56928:56930:56932:56934:56936:56938:56940:56942:56944:56946:56948:56950:56952:56954:56956:56958:56960:56962:56964:56966:56968:56970:56972:56978:56980:56982:56984:57076:57077:57078:57079:57080:57081:57096:57102:57106:57119:57121:57123:57124:57125:57126:57127:57128:57129:57130:57131:57132:57133:57134:57135:57136:57137:57138:57139:57140:57141:80435:80465:80715:80785:93705
 txnid:0
INFO  : Executing 
command(queryId=hadoop_20161027215659_8a71044b-364c-4e41-a168-10213add0d5b): 
insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  
dt='2016-10-21'
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. tez, spark) 
or using Hive 1.X releases.
INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
the future versions. Consider using a different execution engine (i.e. tez, 
spark) or using Hive 1.X releases.
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. tez, spark) 
or using Hive 1.X releases.
INFO  : Query ID = hadoop_20161027215659_8a71044b-364c-4e41-a168-10213add0d5b
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=
DEBUG : Configuring job job_1472611548204_72609 with 
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1472611548204_72609 as the submit 
dir
DEBUG : adding the following namenodes' delegation 
tokens:[hdfs://master.mydata.com:45660]
DEBUG : Creating splits at 
hdfs://master.mydata.com:45660/tmp/hadoop-yarn/staging/hadoop/.staging/job_1472611548204_72609
INFO  : number of splits:3
INFO  : Submitting tokens for job: job_1472611548204_72609
INFO  : The url to track the job: 
http://master.mydata.com:9378/proxy/application_1472611548204_72609/
INFO  : Starting Job = job_1472611548204_72609, Tracking URL = 
http://master.mydata.com:9378/proxy/application_1472611548204_72609/
INFO  : Kill Command = /usr/local/hadoop/bin/hadoop job  -kill 
job_1472611548204_72609
INFO  : Hadoop job information for Stage-1: number of mappers: 3; number of 
reducers: 1
INFO  : 2016-10-27 21:57:14,183 Stage-1 map = 0%,  reduce = 0%
INFO  : 2016-10-27 21:57:22,745 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 
3.19 sec
INFO  : 2016-10-27 21:57:24,910 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 
11.16 sec
INFO  : 2016-10-27 21:57:25,991 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 19.65 sec
INFO  : 2016-10-27 21:57:32,455 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 26.72 sec
INFO  : MapReduce Total cumulative CPU time: 26 seconds 720 msec
INFO  : Ended Job = job_1472611548204_72609
INFO  : Starting task [Stage-7:CONDITIONAL] in serial mode
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread J.P Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612043#comment-15612043
 ] 

J.P Feng commented on SPARK-18107:
--

Here is the execution logs of Hive 1.2.1,  [Insert overwrite]


0: jdbc:hive2://master.mydata.com:23250> insert overwrite table login4game 
partition(pt='mix_en',dt='2016-10-21')select distinct 
account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' 
as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  dt='2016-10-21' ;
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=
INFO  : number of splits:3
INFO  : Submitting tokens for job: job_1472611548204_72607
INFO  : The url to track the job: 
http://master.mydata.com:9378/proxy/application_1472611548204_72607/
INFO  : Starting Job = job_1472611548204_72607, Tracking URL = 
http://master.mydata.com:9378/proxy/application_1472611548204_72607/
INFO  : Kill Command = /usr/local/hadoop/bin/hadoop job  -kill 
job_1472611548204_72607
INFO  : Hadoop job information for Stage-1: number of mappers: 3; number of 
reducers: 1
INFO  : 2016-10-27 21:40:59,879 Stage-1 map = 0%,  reduce = 0%
INFO  : 2016-10-27 21:41:08,391 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 
2.82 sec
INFO  : 2016-10-27 21:41:09,452 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 
9.79 sec
INFO  : 2016-10-27 21:41:11,610 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 17.62 sec
INFO  : 2016-10-27 21:41:21,171 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 25.56 sec
INFO  : MapReduce Total cumulative CPU time: 25 seconds 560 msec
INFO  : Ended Job = job_1472611548204_72607
INFO  : Loading data to table my_log.login4game partition (pt=mix_en, 
dt=2016-10-21) from 
hdfs://master.mydata.com:45660/data/warehouse/staging/.hive-staging_hive_2016-10-27_21-40-48_927_5041661215303190236-1/-ext-1
INFO  : Partition my_log.login4game{pt=mix_en, dt=2016-10-21} stats: 
[numFiles=1, numRows=66138, totalSize=485769, rawDataSize=41402388]
No rows affected (520.037 seconds)

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-27 Thread J.P Feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612033#comment-15612033
 ] 

J.P Feng commented on SPARK-18107:
--

Thanks for your reply. I have tested the performance between hive 2.0.1 and 
hive 1.2.1, which verifies that you may be true !

The version 1.2.1 is downloaded from here, 
http://mirror.bit.edu.cn/apache/hive/hive-1.2.1/

where the patch [HIVE-11940] is available after hive 2.0.0

I run the same sql between version Hive 1.2.1 and Hive 2.0.1 with the same data.

In Hive 1.2.1, it costs 520.037 seconds to complete the insert overwrite 
statement which is similar to what it does in spark (with 1.2.1 hive version), 
but costs 32.183 seconds in completing the insert into statement.

In Hive 2.0.1, it costs 35.975 seconds to complete the insert overwrite 
statement.


I will paste the execution logs in a new comment to show the running process 
between Hive 1.2.1 and Hive 2.0.1 





> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-10-27 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611926#comment-15611926
 ] 

Benjamin Fradet commented on SPARK-16857:
-

I was wondering why a KMeansEvalutor computing the wsse hasn't been implemented 
yet.

Any ideas why not?

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18128) Add support for publishing to PyPI

2016-10-27 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18128:

Description: 
After SPARK-1267 is done we should add support for publishing to PyPI similar 
to how we publish to maven central.

Note: one of the open questions is what to do about package name since someone 
has registered the package name PySpark on PyPI - we could use ApachePySpark or 
we could try and get find who registered PySpark and get them to transfer it to 
us (since they haven't published anything so maybe fine?)

  was:After SPARK-1267 is done we should add support for publishing to PyPI 
similar to how we publish to maven central.


> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck

2016-10-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18137:
--
Description: 
when run a sql with distinct(on spark github master branch), it throw 
UnresolvedException.

For example:
run a test case on spark(branch master)  with sql:
{noformat}
SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
FROM src LIMIT 1
{noformat}
and it throw exception:
{noformat}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
DOUBLE), CAST(0.9BD AS DOUBLE), 1)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at

[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2016-10-27 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611712#comment-15611712
 ] 

Herman van Hovell commented on SPARK-18134:
---

Maps are not comparable. This makes them unusable as a join and grouping keys. 
We generally try to be as compatible with Hive as we can be, but in this case I 
am not sure that we should do this.

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Christian Zorneck
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18140) Parquet NPE / Update to 1.9

2016-10-27 Thread dori waldman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611691#comment-15611691
 ] 

dori waldman commented on SPARK-18140:
--

Is there any suggestion how to solve this issue now ? I cant really wait for 
the next spark version , this issue is in our production 

should I recompile spark ? is there any other way to change the parquet jars 

> Parquet NPE / Update to 1.9
> ---
>
> Key: SPARK-18140
> URL: https://issues.apache.org/jira/browse/SPARK-18140
> Project: Spark
>  Issue Type: Bug
>Reporter: dori waldman
>
> We are using spark to write parquet files , 
> We are getting nullPointerException due to this issue 
> https://issues.apache.org/jira/browse/PARQUET-544
> trying to change to the new fixed parquet version (1.9.0) didnt worked 
> In submit.sh added following parameters
> --jars "../lib/parquet-hadoop-1.9.0.jar"
> --conf "spark.driver.userClassPathFirst=true"
> --conf "spark.executor.userClassPathFirst=true"
> Any advise ? its in our production environment
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18139) Dataset mapGroups with return typ Seq[Product] produces scala.ScalaReflectionException: object $line262.$read not found

2016-10-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611674#comment-15611674
 ] 

Sean Owen commented on SPARK-18139:
---

I'm pretty sure this is just another instance of "case classes don't quite work 
with the shell in Spark", and related JIRAs.

> Dataset mapGroups with return typ Seq[Product] produces 
> scala.ScalaReflectionException: object $line262.$read not found
> ---
>
> Key: SPARK-18139
> URL: https://issues.apache.org/jira/browse/SPARK-18139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Zach Kull
>
> mapGroups fails on Dataset if return type is only a Seq[Product]. It succeeds 
> if return type is more complex like Seq[(Int,Product)]. See the following 
> code sample:
> {code}
> case class A(b:Int, c:Int)
> // Sample Dataset[A]
> val ds = ss.createDataset(Seq(A(1,2),A(2,2)))
> // The following aggregation should produce a Dataset[Seq[A]], but FAILS with 
> scala.ScalaReflectionException: object $line262.$read not found.
> val ds2 = ds.groupByKey(_.b).mapGroups{ case (g,i) => (i.toSeq) }
> // Produces Dataset[(Int, Seq[A])] -> OK
> val ds1 = ds.groupByKey(_.b).mapGroups{ case (g,i) => (g,i.toSeq) }
> // reproducable when trying to manuely create the following Encoder
> val e = newProductSeqEncoder[A]
> {code}
> Full Exception:
> scala.ScalaReflectionException: object $line262.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator4$1.apply(:116)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
>   ... 75 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 189 matches

Mail list logo