[jira] [Commented] (SPARK-6319) Should throw analysis exception when using binary type in groupby/join

2016-01-31 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125710#comment-15125710
 ] 

Low Chin Wei commented on SPARK-6319:
-

Hi guys,

Any resolution for group by binary column using Spark.

> Should throw analysis exception when using binary type in groupby/join
> --
>
> Key: SPARK-6319
> URL: https://issues.apache.org/jira/browse/SPARK-6319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 1.5.0
>
>
> Spark shell session for reproduction:
> {noformat}
> scala> import sqlContext.implicits._
> scala> import org.apache.spark.sql.types._
> scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" 
> cast BinaryType).distinct.show()
> ...
> CAST(c, BinaryType)
> [B@43f13160
> [B@5018b648
> [B@3be22500
> [B@476fc8a1
> {noformat}
> Spark SQL uses plain byte arrays to represent binary values. However, arrays 
> are compared by reference rather than by value. On the other hand, the 
> DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
> for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-14 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574502#comment-15574502
 ] 

Low Chin Wei commented on SPARK-13747:
--

I encounter this in 2.0.1, is there any workaround like having separate 
SparkSession will help?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-16 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580989#comment-15580989
 ] 

Low Chin Wei commented on SPARK-13747:
--

java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]


> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-16 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581113#comment-15581113
 ] 

Low Chin Wei commented on SPARK-13747:
--

It is running on Akka, with forkjoin dispatcher. There are 2 actors running 
concurrently doing different Spark job using the same SparkSession. I can't 
give the full stack, but here is the outline:

java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]

<-- Here is the code that call the df.count  -->

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.actor.ActorCell.invoke(ActorCell.scala:495) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
[akka-actor_2.11-2.4.8.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 [scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [scala-library-2.11.8.jar:na]




> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org