[jira] [Resolved] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20567.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Failure to bind when using explode and collect_set in streaming
> ---
>
> Key: SPARK-20567
> URL: https://issues.apache.org/jira/browse/SPARK-20567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 2.2.0
>
>
> Here is a small test case:
> {code}
>   test("count distinct") {
> val inputData = MemoryStream[(Int, Seq[Int])]
> val aggregated =
>   inputData.toDF()
> .select($"*", explode($"_2") as 'value)
> .groupBy($"_1")
> .agg(size(collect_set($"value")))
> .as[(Int, Int)]
> testStream(aggregated, Update)(
>   AddData(inputData, (1, Seq(1, 2))),
>   CheckLastBatch((1, 2))
> )
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20573) --packages fails when transitive dependency can only be resolved from repository specified in POM's tag

2017-05-02 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20573:
--

 Summary: --packages fails when transitive dependency can only be 
resolved from repository specified in POM's  tag
 Key: SPARK-20573
 URL: https://issues.apache.org/jira/browse/SPARK-20573
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.1.0, 2.0.0
Reporter: Josh Rosen


With a clean Ivy cache, run the following command:

{code}
./bin/spark-shell --packages com.twitter.elephantbird:elephant-bird-core:4.4
{code}

This will fail with {{unresolved dependency: 
com.hadoop.gplcompression#hadoop-lzo;0.4.16: not found}}.

 If you look at the elephant-bird-core POM (at 
http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.4/elephant-bird-core-4.4.pom)
 you'll see a direct dependency on hadoop-lzo. This library is only present in 
Twitter's public Maven repository, hosted at http://maven.twttr.com.The 
elephant-bird-core POM does not directly declare Twitter's external repository. 
Instead, that external repository is inherited from elephant-bird-core's parent 
POM (at 
http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird/4.4/elephant-bird-4.4.pom).

>From the Ivy output it looks like it it didn't even attempt to resolve from 
>the Twitter repo:

{code}
:: problems summary ::
 WARNINGS
module not found: com.hadoop.gplcompression#hadoop-lzo;0.4.16

 local-m2-cache: tried

  
file:/Users/joshrosen/.m2/repository/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom

  -- artifact 
com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar:

  
file:/Users/joshrosen/.m2/repository/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar

 local-ivy-cache: tried

  
/Users/joshrosen/.ivy2/local/com.hadoop.gplcompression/hadoop-lzo/0.4.16/ivys/ivy.xml

  -- artifact 
com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar:

  
/Users/joshrosen/.ivy2/local/com.hadoop.gplcompression/hadoop-lzo/0.4.16/jars/hadoop-lzo.jar

 central: tried

  
https://repo1.maven.org/maven2/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom

  -- artifact 
com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar:

  
https://repo1.maven.org/maven2/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar

 spark-packages: tried

  
http://dl.bintray.com/spark-packages/maven/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom

  -- artifact 
com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar:

  
http://dl.bintray.com/spark-packages/maven/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar

::

::  UNRESOLVED DEPENDENCIES ::

::

:: com.hadoop.gplcompression#hadoop-lzo;0.4.16: not found

::
{code}

If you manually specify the Twitter repository as an additional external 
repository then everything works fine.

This is a somewhat frustrating behavior from an end-user's point of view 
because unless they dig through the POMs themselves it is not obvious why 
things are broken or how to fix them. When Maven resolves this coordinate it 
properly fetches the transitive dependencies from the additional repositories 
specified in the referencing POMs. My hunch is that this behavior is caused by 
either a bug in Ivy itself or a bug in Spark's usage / configuration of the 
embedded Ivy resolver.

It would be great to see if we can find other test-cases to narrow down the 
scope of the bug. I'm wondering whether POM-specified repositories will work if 
they're specified in the POM of the top-level dependency being resolved. It 
would also be useful to determine whether Ivy handles additional repositories 
in the top-level of transitive dependencies' POMs: maybe the problem is the 
specific combination of transitive dep + repository inherited from that dep's 
parent POM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4836) Web UI should display separate information for all stage attempts

2017-05-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994213#comment-15994213
 ] 

Josh Rosen commented on SPARK-4836:
---

[~ckadner], I'm pretty sure that this is still a problem.

Regarding reproductions, you should be able to trigger this by triggering a 
fetch failure: run a shuffle stage, then delete some random portion of shuffle 
outputs while the reduce stage is running. The reduce stage should fail and 
re-run the previous map stage, leading to a second stage attempt.

> Web UI should display separate information for all stage attempts
> -
>
> Key: SPARK-4836
> URL: https://issues.apache.org/jira/browse/SPARK-4836
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>
> I've run into some cases where the web UI job page will say that a job took 
> 12 minutes but the sum of that job's stage times is something like 10 
> seconds.  In this case, it turns out that my job ran a stage to completion 
> (which took, say, 5 minutes) then lost some partitions of that stage and had 
> to run a new stage attempt to recompute one or two tasks from that stage.  As 
> a result, the latest attempt for that stage reports only one or two tasks.  
> In the web UI, it seems that we only show the latest stage attempt, not all 
> attempts, which can lead to confusing / misleading displays for jobs with 
> failed / partially-recomputed stages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20572) Spark Streaming fail to read file on Hdfs

2017-05-02 Thread LvDongrong (JIRA)
LvDongrong created SPARK-20572:
--

 Summary: Spark Streaming fail to read file on Hdfs
 Key: SPARK-20572
 URL: https://issues.apache.org/jira/browse/SPARK-20572
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0
Reporter: LvDongrong


When I move a file on hdfs to the target directory where Spark Streaming read 
from, Spark Streaming fail to read the file. Then I found the Spark Streaming 
Hdfs interface only read the file whose modtime is in the interval of the 
current batch,but when we move the file to the target directory,the modtime of 
the file is never changed. May be the method to find new file is not 
appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2017-05-02 Thread Ethan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994199#comment-15994199
 ] 

Ethan Xu commented on SPARK-12009:
--

I'm getting similar error message in with Spark 2.1.0. I can't reproduce it. 
The exact same code worked fine on a small RDD (sample), but sometimes gave 
this error on large RDD after hours of ran. It's very frustrating. 

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Assignee: SuYan
>Priority: Minor
> Fix For: 2.0.0
>
>
> Log based 1.4.0
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2017-05-02 Thread Supriya Pasham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994192#comment-15994192
 ] 

Supriya Pasham edited comment on SPARK-12216 at 5/3/17 2:46 AM:


Hi Team,

I am executing 'spark-submit' with a jar and properties file in the below 
manner 

  -> spark-submit --class package.classname --master local[*] 
\Spark.jar data.properties

When i run the above command, immediately 2-3 exceptions are displayed in the 
command prompt with below exception details.

I have seen that this is issue is marked as resolved, but i dint fin correct 
resolution.
Please let me know if there is a solution to this issue -

ERROR ShutdownHookManager: Exception while deleting Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387
855
java.io.IOException: Failed to delete: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855



Environment details : I am running the commands in Windows 7 machine
Request you to provide a solution asap.


was (Author: supriya):
Hi Team,

I am executing 'spark-submit' with a jar and properties file in the below 
manner 

  -> spark-submit --class package.classname --master local[*] 
\Spark.jar data.properties

When i run the above command, immediately 2-3 exceptions are displayed in the 
command prompt with below exception details.

I have seen that this is issue is marked as resolved, but i dint fin correct 
resolution.
Please let me know if there is a solution to this issue -

ERROR ShutdownHookManager: Exception while deleting Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387
855
java.io.IOException: Failed to delete: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855


Request you to provide a solution asap.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2017-05-02 Thread Supriya Pasham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994192#comment-15994192
 ] 

Supriya Pasham commented on SPARK-12216:


Hi Team,

I am executing 'spark-submit' with a jar and properties file in the below 
manner 

  -> spark-submit --class package.classname --master local[*] 
\Spark.jar data.properties

When i run the above command, immediately 2-3 exceptions are displayed in the 
command prompt with below exception details.

I have seen that this is issue is marked as resolved, but i dint fin correct 
resolution.
Please let me know if there is a solution to this issue -

ERROR ShutdownHookManager: Exception while deleting Spark temp dir: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387
855
java.io.IOException: Failed to delete: 
C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855


Request you to provide a solution asap.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-02 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994185#comment-15994185
 ] 

Burak Yavuz commented on SPARK-20571:
-

cc [~felixcheung]

> Flaky SparkR StructuredStreaming tests
> --
>
> Key: SPARK-20571
> URL: https://issues.apache.org/jira/browse/SPARK-20571
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-02 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-20571:
---

 Summary: Flaky SparkR StructuredStreaming tests
 Key: SPARK-20571
 URL: https://issues.apache.org/jira/browse/SPARK-20571
 Project: Spark
  Issue Type: Test
  Components: SparkR, Structured Streaming
Affects Versions: 2.2.0
Reporter: Burak Yavuz


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20558.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2
   2.0.3

> clear InheritableThreadLocal variables in SparkContext when stopping it
> ---
>
> Key: SPARK-20558
> URL: https://issues.apache.org/jira/browse/SPARK-20558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20570) The main version number on docs/latest/index.html

2017-05-02 Thread liucht-inspur (JIRA)
liucht-inspur created SPARK-20570:
-

 Summary: The main version number on docs/latest/index.html
 Key: SPARK-20570
 URL: https://issues.apache.org/jira/browse/SPARK-20570
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.1
Reporter: liucht-inspur


On the spark.apache.org home page, when I click the menu  Latest Release (Spark 
2.1.1) under the documentation menu ,the next page latest appear with display 
2.1.0 lable in the upper left corner of the page



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20569) In spark-sql,Some functions can execute successfully, when the number of input parameter is wrong

2017-05-02 Thread liuxian (JIRA)
liuxian created SPARK-20569:
---

 Summary: In spark-sql,Some functions can execute successfully, 
when the  number of input parameter is wrong
 Key: SPARK-20569
 URL: https://issues.apache.org/jira/browse/SPARK-20569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: liuxian
Priority: Trivial


>select  Nvl(null,'1',3);
>3

The function of "Nvl" has Only two  input parameters,so, when input three 
parameters, i think it should notice that:"Error in query: Invalid number of 
arguments for function nvl".

Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20568) Delete files after processing

2017-05-02 Thread Saul Shanabrook (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saul Shanabrook updated SPARK-20568:

Description: 
It would be great to be able to delete files after processing them with 
structured streaming.

For example, I am reading in a bunch of JSON files and converting them into 
Parquet. If the JSON files are not deleted after they are processed, it quickly 
fills up my hard drive. I originally [posted this on Stack 
Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
make a feature request for it. 

  was:
It would be great to be able to delete files after processing them with 
structured streaming.

For example, I am reading in a bunch of JSON files and converting them into 
Parquet. If the JSON files are not deleted after they are processed, it quickly 
fills up my hard drive. I originally [posted this on Stack 
Overflow](http://stackoverflow.com/q/43671757/907060) and was recommended to 
make a feature request for it. 


> Delete files after processing
> -
>
> Key: SPARK-20568
> URL: https://issues.apache.org/jira/browse/SPARK-20568
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Saul Shanabrook
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20568) Delete files after processing

2017-05-02 Thread Saul Shanabrook (JIRA)
Saul Shanabrook created SPARK-20568:
---

 Summary: Delete files after processing
 Key: SPARK-20568
 URL: https://issues.apache.org/jira/browse/SPARK-20568
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Saul Shanabrook


It would be great to be able to delete files after processing them with 
structured streaming.

For example, I am reading in a bunch of JSON files and converting them into 
Parquet. If the JSON files are not deleted after they are processed, it quickly 
fills up my hard drive. I originally [posted this on Stack 
Overflow](http://stackoverflow.com/q/43671757/907060) and was recommended to 
make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20567:


Assignee: Michael Armbrust  (was: Apache Spark)

> Failure to bind when using explode and collect_set in streaming
> ---
>
> Key: SPARK-20567
> URL: https://issues.apache.org/jira/browse/SPARK-20567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>
> Here is a small test case:
> {code}
>   test("count distinct") {
> val inputData = MemoryStream[(Int, Seq[Int])]
> val aggregated =
>   inputData.toDF()
> .select($"*", explode($"_2") as 'value)
> .groupBy($"_1")
> .agg(size(collect_set($"value")))
> .as[(Int, Int)]
> testStream(aggregated, Update)(
>   AddData(inputData, (1, Seq(1, 2))),
>   CheckLastBatch((1, 2))
> )
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20567:


Assignee: Apache Spark  (was: Michael Armbrust)

> Failure to bind when using explode and collect_set in streaming
> ---
>
> Key: SPARK-20567
> URL: https://issues.apache.org/jira/browse/SPARK-20567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> Here is a small test case:
> {code}
>   test("count distinct") {
> val inputData = MemoryStream[(Int, Seq[Int])]
> val aggregated =
>   inputData.toDF()
> .select($"*", explode($"_2") as 'value)
> .groupBy($"_1")
> .agg(size(collect_set($"value")))
> .as[(Int, Int)]
> testStream(aggregated, Update)(
>   AddData(inputData, (1, Seq(1, 2))),
>   CheckLastBatch((1, 2))
> )
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994085#comment-15994085
 ] 

Apache Spark commented on SPARK-20567:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/17838

> Failure to bind when using explode and collect_set in streaming
> ---
>
> Key: SPARK-20567
> URL: https://issues.apache.org/jira/browse/SPARK-20567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>
> Here is a small test case:
> {code}
>   test("count distinct") {
> val inputData = MemoryStream[(Int, Seq[Int])]
> val aggregated =
>   inputData.toDF()
> .select($"*", explode($"_2") as 'value)
> .groupBy($"_1")
> .agg(size(collect_set($"value")))
> .as[(Int, Int)]
> testStream(aggregated, Update)(
>   AddData(inputData, (1, Seq(1, 2))),
>   CheckLastBatch((1, 2))
> )
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20567) Failure to bind when using explode and collect_set in streaming

2017-05-02 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20567:


 Summary: Failure to bind when using explode and collect_set in 
streaming
 Key: SPARK-20567
 URL: https://issues.apache.org/jira/browse/SPARK-20567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


Here is a small test case:
{code}
  test("count distinct") {
val inputData = MemoryStream[(Int, Seq[Int])]

val aggregated =
  inputData.toDF()
.select($"*", explode($"_2") as 'value)
.groupBy($"_1")
.agg(size(collect_set($"value")))
.as[(Int, Int)]

testStream(aggregated, Update)(
  AddData(inputData, (1, Seq(1, 2))),
  CheckLastBatch((1, 2))
)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy

2017-05-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993888#comment-15993888
 ] 

Josh Rosen commented on SPARK-10878:


[~jeeyoungk], my understanding is that there are two possible sources of races 
here:

1. Racing on the module descriptor / POM that we write out. Your suggestion of 
using a dummy string would address that.
2. Racing on writes to other files in the Ivy resolution cache. This is more 
likely to occur when the cache is initially clean and would manifest itself as 
corruption in other files.

Given this, I think the most robust solution is to use separate resolution 
caches since that would solve both problems in one fell swoop. If that's going 
to be hard then the partial fix of randomizing the module descriptor version 
(say, to use a timestamp) isn't a bad idea because it's pretty cheap to 
implement and probably covers the majority of real-world races here.

> Race condition when resolving Maven coordinates via Ivy
> ---
>
> Key: SPARK-10878
> URL: https://issues.apache.org/jira/browse/SPARK-10878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I've recently been shell-scripting the creation of many concurrent 
> Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
> guessing is a race condition in their Maven-coordinate resolution.
> For example, I might spawn an app for each path in file {{paths}} with the 
> following shell script:
> {code}
> cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
> {code}
> When doing this, I observe some fraction of the spawned jobs to fail with 
> errors like:
> {code}
> :: retrieving :: org.apache.spark#spark-submit-parent
> confs: [default]
> Exception in thread "main" java.lang.RuntimeException: problem during 
> retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
> failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
> at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
> at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.text.ParseException: failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
> ... 7 more
> Caused by: org.xml.sax.SAXParseException; Premature end of file.
> at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> {code}
> The more apps I try to launch simultaneously, the greater fraction of them 
> seem to fail with this or similar errors; a batch of ~10 will usually work 
> fine, a batch of 15 will see a few failures, and a batch of ~60 will have 
> dozens of failures.
> [This gist shows 11 recent failures I 
> observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20566) ColumnVector should support `appendFloats` for array

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20566:


Assignee: (was: Apache Spark)

> ColumnVector should support `appendFloats` for array
> 
>
> Key: SPARK-20566
> URL: https://issues.apache.org/jira/browse/SPARK-20566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>
> This issue aims to add a missing `appendFloats` API for array into 
> ColumnVector class. For double type, there is `appendDoubles` for array 
> [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20566) ColumnVector should support `appendFloats` for array

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20566:


Assignee: Apache Spark

> ColumnVector should support `appendFloats` for array
> 
>
> Key: SPARK-20566
> URL: https://issues.apache.org/jira/browse/SPARK-20566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to add a missing `appendFloats` API for array into 
> ColumnVector class. For double type, there is `appendDoubles` for array 
> [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20566) ColumnVector should support `appendFloats` for array

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993753#comment-15993753
 ] 

Apache Spark commented on SPARK-20566:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17836

> ColumnVector should support `appendFloats` for array
> 
>
> Key: SPARK-20566
> URL: https://issues.apache.org/jira/browse/SPARK-20566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>
> This issue aims to add a missing `appendFloats` API for array into 
> ColumnVector class. For double type, there is `appendDoubles` for array 
> [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20566) ColumnVector should support `appendFloats` for array

2017-05-02 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20566:
--
Description: This issue aims to add a missing `appendFloats` API for array 
into ColumnVector class. For double type, there is `appendDoubles` for array 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].
  (was: This PR aims to add a missing `appendFloats` API for array into 
ColumnVector class. For double type, there is `appendDoubles` for array 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].)

> ColumnVector should support `appendFloats` for array
> 
>
> Key: SPARK-20566
> URL: https://issues.apache.org/jira/browse/SPARK-20566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>
> This issue aims to add a missing `appendFloats` API for array into 
> ColumnVector class. For double type, there is `appendDoubles` for array 
> [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993745#comment-15993745
 ] 

Apache Spark commented on SPARK-20557:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17835

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>  Labels: easyfix, jdbc, oracle, sql, timestamp
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20566) ColumnVector should support `appendFloats` for array

2017-05-02 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-20566:
-

 Summary: ColumnVector should support `appendFloats` for array
 Key: SPARK-20566
 URL: https://issues.apache.org/jira/browse/SPARK-20566
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.0
Reporter: Dongjoon Hyun


This PR aims to add a missing `appendFloats` API for array into ColumnVector 
class. For double type, there is `appendDoubles` for array 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20565) Improve the error message for unsupported JDBC types

2017-05-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20565:

Description: 
For unsupported data types, we simply output the type number instead of the 
type name. 

{noformat}
java.sql.SQLException: Unsupported type 2014
{noformat}

We should improve it by outputting its name.


  was:
For unsupported data types, we simply output the type number instead of the 
type name. 

{noformat}
java.sql.SQLException: Unsupported type 2014
{noformat}



> Improve the error message for unsupported JDBC types
> 
>
> Key: SPARK-20565
> URL: https://issues.apache.org/jira/browse/SPARK-20565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> For unsupported data types, we simply output the type number instead of the 
> type name. 
> {noformat}
> java.sql.SQLException: Unsupported type 2014
> {noformat}
> We should improve it by outputting its name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20565) Improve the error message for unsupported JDBC types

2017-05-02 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20565:
---

 Summary: Improve the error message for unsupported JDBC types
 Key: SPARK-20565
 URL: https://issues.apache.org/jira/browse/SPARK-20565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


For unsupported data types, we simply output the type number instead of the 
type name. 

{noformat}
java.sql.SQLException: Unsupported type 2014
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20529) Worker should not use the received Master address

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-20529:


Assignee: Shixiong Zhu

> Worker should not use the received Master address
> -
>
> Key: SPARK-20529
> URL: https://issues.apache.org/jira/browse/SPARK-20529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now when worker connects to master, master will send its address to the 
> worker. Then worker will save this address and use it to reconnect in case of 
> failure.
> However, sometimes, this address is not correct. If there is a proxy between 
> master and worker, the address master sent is not the address of proxy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20529) Worker should not use the received Master address

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20529:
-
Affects Version/s: 2.2.0
   1.6.3
   2.0.2

> Worker should not use the received Master address
> -
>
> Key: SPARK-20529
> URL: https://issues.apache.org/jira/browse/SPARK-20529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now when worker connects to master, master will send its address to the 
> worker. Then worker will save this address and use it to reconnect in case of 
> failure.
> However, sometimes, this address is not correct. If there is a proxy between 
> master and worker, the address master sent is not the address of proxy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20531) Spark master shouldn't send its address back to the workers over proxied connections

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20531.
--
Resolution: Duplicate

> Spark master shouldn't send its address back to the workers over proxied 
> connections
> 
>
> Key: SPARK-20531
> URL: https://issues.apache.org/jira/browse/SPARK-20531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0
>Reporter: Sameer Agarwal
>
> Currently, when a Spark worker connects to Spark master, the master sends its 
> address back to the worker (as part of the {{RegisteredWorker}} message). The 
> worker then saves this address and use it to talk to the master. The reason 
> behind this handshake protocol is that if the master goes down, a new master 
> can always send back {{RegisteredWorker}} messages to all the workers with 
> its (new) IP address.
> However, if there is a proxy between the master and worker, this 
> unfortunately ends up bypassing the proxy. A simple fix here is that we 
> should encode the  "destination address" in the {{RegisterWorker}} that can 
> then be sent back to the to the worker as part of the {{RegisteredWorker}} 
> message.
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20436) NullPointerException when restart from checkpoint file

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20436:
-
Issue Type: Question  (was: Bug)

> NullPointerException when restart from checkpoint file
> --
>
> Key: SPARK-20436
> URL: https://issues.apache.org/jira/browse/SPARK-20436
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: fangfengbin
>
> I have written a Spark Streaming application which have two DStreams.
> Code is :
> {code}
> object KafkaTwoInkfk {
>   def main(args: Array[String]) {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
> createContext(args))
> ssc.start()
> ssc.awaitTermination()
>   }
>   def createContext(args : Array[String]) : StreamingContext = {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val sparkConf = new SparkConf().setAppName("KafkaWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))
> ssc.checkpoint(checkPointDir)
> val topicArr1 = topic1.split(",")
> val topicSet1 = topicArr1.toSet
> val topicArr2 = topic2.split(",")
> val topicSet2 = topicArr2.toSet
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> brokers
> )
> val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet1)
> val words1 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts1 = words1.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts1.print()
> val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet2)
> val words2 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts2 = words2.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts2.print()
> return ssc
>   }
> }
> {code}
> when  restart from checkpoint file, it throw NullPointerException:
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at 

[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file

2017-05-02 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993663#comment-15993663
 ] 

Shixiong Zhu commented on SPARK-20436:
--

[~ffbin] the issue is this line {{val words2 = 
lines1.map(_._2).flatMap(_.split(" "))}}. You should replace {{lines1}} with 
{{lines2}}. Otherwise, {{lines2}} won't be used and cause this error.

> NullPointerException when restart from checkpoint file
> --
>
> Key: SPARK-20436
> URL: https://issues.apache.org/jira/browse/SPARK-20436
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: fangfengbin
>
> I have written a Spark Streaming application which have two DStreams.
> Code is :
> {code}
> object KafkaTwoInkfk {
>   def main(args: Array[String]) {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
> createContext(args))
> ssc.start()
> ssc.awaitTermination()
>   }
>   def createContext(args : Array[String]) : StreamingContext = {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val sparkConf = new SparkConf().setAppName("KafkaWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))
> ssc.checkpoint(checkPointDir)
> val topicArr1 = topic1.split(",")
> val topicSet1 = topicArr1.toSet
> val topicArr2 = topic2.split(",")
> val topicSet2 = topicArr2.toSet
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> brokers
> )
> val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet1)
> val words1 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts1 = words1.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts1.print()
> val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet2)
> val words2 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts2 = words2.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts2.print()
> return ssc
>   }
> }
> {code}
> when  restart from checkpoint file, it throw NullPointerException:
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   

[jira] [Resolved] (SPARK-20436) NullPointerException when restart from checkpoint file

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20436.
--
Resolution: Not A Problem

> NullPointerException when restart from checkpoint file
> --
>
> Key: SPARK-20436
> URL: https://issues.apache.org/jira/browse/SPARK-20436
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: fangfengbin
>
> I have written a Spark Streaming application which have two DStreams.
> Code is :
> {code}
> object KafkaTwoInkfk {
>   def main(args: Array[String]) {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
> createContext(args))
> ssc.start()
> ssc.awaitTermination()
>   }
>   def createContext(args : Array[String]) : StreamingContext = {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val sparkConf = new SparkConf().setAppName("KafkaWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))
> ssc.checkpoint(checkPointDir)
> val topicArr1 = topic1.split(",")
> val topicSet1 = topicArr1.toSet
> val topicArr2 = topic2.split(",")
> val topicSet2 = topicArr2.toSet
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> brokers
> )
> val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet1)
> val words1 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts1 = words1.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts1.print()
> val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet2)
> val words2 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts2 = words2.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts2.print()
> return ssc
>   }
> }
> {code}
> when  restart from checkpoint file, it throw NullPointerException:
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>   at 

[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20547:
-
Target Version/s:   (was: 2.2.0)

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2017-05-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20547:
-
Priority: Minor  (was: Blocker)

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-02 Thread Hua Liu (JIRA)
Hua Liu created SPARK-20564:
---

 Summary: a lot of executor failures when the executor number is 
more than 2000
 Key: SPARK-20564
 URL: https://issues.apache.org/jira/browse/SPARK-20564
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.1.0, 1.6.2
Reporter: Hua Liu


When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they are 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993464#comment-15993464
 ] 

Herman van Hovell edited comment on SPARK-20556 at 5/2/17 6:34 PM:
---

[~vlyubin] This was fixed in SPARK-18952. This should be part of the coming 
2.1.1 release. 


was (Author: hvanhovell):
[~vlyubin] This was fixed in SPARK-18952, lets just back port that change to 
2.1.

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
> Fix For: 2.2.0
>
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993465#comment-15993465
 ] 

Apache Spark commented on SPARK-7481:
-

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/17834

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993464#comment-15993464
 ] 

Herman van Hovell commented on SPARK-20556:
---

[~vlyubin] This was fixed in SPARK-18952, lets just back port that change to 
2.1.

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
> Fix For: 2.2.0
>
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-20556.
-
   Resolution: Duplicate
Fix Version/s: 2.2.0

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
> Fix For: 2.2.0
>
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-20556:
--
Component/s: (was: Optimizer)
 SQL

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.

2017-05-02 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993426#comment-15993426
 ] 

Shixiong Zhu commented on SPARK-20547:
--

Did some investigation using the reproducer. Looks like it’s not a class loader 
issue. It’s because the class initialization result will be cached. My current 
proposal is recreating a new class loader if a task fails.

> ExecutorClassLoader's findClass may not work correctly when a task is 
> cancelled.
> 
>
> Key: SPARK-20547
> URL: https://issues.apache.org/jira/browse/SPARK-20547
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> ExecutorClassLoader's findClass may throw some transient exception. For 
> example, when a task is cancelled, if ExecutorClassLoader is running, you may 
> see InterruptedException or IOException, even if this class can be loaded. 
> Then the result of findClass will be cached by JVM, and later when the same 
> class is being loaded (note: in this case, this class may be still loadable), 
> it will just throw NoClassDefFoundError.
> We should make ExecutorClassLoader retry on transient exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20370) create external table on read only location fails

2017-05-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993405#comment-15993405
 ] 

Steve Loughran commented on SPARK-20370:


Is this the bit under the PR tagged "!! HACK ALERT !!" by any chance? If so, it 
seems to have gone in for a Hive metastore workaround. I wonder if there is/can 
be a solution in Hive-land.

> create external table on read only location fails
> -
>
> Key: SPARK-20370
> URL: https://issues.apache.org/jira/browse/SPARK-20370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
> Environment: EMR 5.4.0, hadoop 2.7.3, spark 2.1.0
>Reporter: Gaurav Shah
>Priority: Minor
>
> Create External table via following fails:
>  sqlContext.createExternalTable(
>  "table_name",
>  "org.apache.spark.sql.parquet",
>  inputSchema,
>  Map(
>"path" -> "s3a://bucket-name/folder",
>"mergeSchema" -> "false"
>  )
>)
> Spark in the following commit tries to check if it has write access to giving 
> location, which fails and so the table meta creation fails. 
> https://github.com/apache/spark/pull/13270/files
> The table creation script works even if cluster has read only access in spark 
> 1.6, but fails in spark 2.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations

2017-05-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993323#comment-15993323
 ] 

Steve Loughran commented on SPARK-20560:


To follow this up, I've now got a test which verifies that (a) s3a returns 
"localhost" and (b) spark discards it. This'll catch any regressions in the s3a 
client.
{code}
val source = CSV_TESTFILE.get
val fs = getFilesystem(source)
val blockLocations = fs.getFileBlockLocations(source, 0, 1)
assert(1 === blockLocations.length,
  s"block location array size wrong: ${blockLocations}")
val hosts = blockLocations(0).getHosts
assert(1 === hosts.length, s"wrong host size ${hosts}")
assert("localhost" === hosts(0), "hostname")

val path = source.toString
val rdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](path, 1)
val input = rdd.asInstanceOf[HadoopRDD[_, _]]
val partitions = input.getPartitions
val locations = input.getPreferredLocations(partitions.head)
assert(locations.isEmpty, s"Location list not empty ${locations}")
{code}

> Review Spark's handling of filesystems returning "localhost" in 
> getFileBlockLocations
> -
>
> Key: SPARK-20560
> URL: https://issues.apache.org/jira/browse/SPARK-20560
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Some filesystems (S3a, Azure WASB) return "localhost" as the response to 
> {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the 
> preferred host when scheduling work, there's a risk that work will be queued 
> on one host, rather than spread across the cluster.
> HIVE-14060 and TEZ-3291 have both seen it in their schedulers.
> I don't know if Spark does it, someone needs to look at the code, maybe write 
> some tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20563) going to DataFrame to RDD and back changes the schema, if the schema is not explicitly provided

2017-05-02 Thread Danil Kirsanov (JIRA)
Danil Kirsanov created SPARK-20563:
--

 Summary: going to DataFrame to RDD and back changes the schema, if 
the schema is not explicitly provided
 Key: SPARK-20563
 URL: https://issues.apache.org/jira/browse/SPARK-20563
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Danil Kirsanov
Priority: Minor


df.rdd.toDF() converts the DataFrame of IntegerType to the LongType if the 
schema is not explicitly provided in toDF().

Below is a full reproduction code
-

from pyspark.sql.types import IntegerType, StructType, StructField
schema = StructType([StructField("a",IntegerType(),True), 
StructField("b",IntegerType(),True)])

df_test = spark.createDataFrame([(1,2)], schema)
df_test.printSchema()
df_test.rdd.toDF().printSchema()




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-02 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20562:
-
Description: 
Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.

Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance

  was:
Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.
Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance


> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-02 Thread Kamal Gurala (JIRA)
Kamal Gurala created SPARK-20562:


 Summary: Support Maintenance by having a threshold for 
unavailability
 Key: SPARK-20562
 URL: https://issues.apache.org/jira/browse/SPARK-20562
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Kamal Gurala


Make Spark be aware of offers that have an unavailability period set because of 
a scheduled Maintenance on the node.
Have a configurable option that's a threshold which ensures that tasks are not 
scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations

2017-05-02 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved SPARK-20560.

Resolution: Invalid

"localhost" is filtered, been done in {{HadoopRDD.getPreferredLocations()}} 
since commit #06aac8a.



> Review Spark's handling of filesystems returning "localhost" in 
> getFileBlockLocations
> -
>
> Key: SPARK-20560
> URL: https://issues.apache.org/jira/browse/SPARK-20560
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Some filesystems (S3a, Azure WASB) return "localhost" as the response to 
> {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the 
> preferred host when scheduling work, there's a risk that work will be queued 
> on one host, rather than spread across the cluster.
> HIVE-14060 and TEZ-3291 have both seen it in their schedulers.
> I don't know if Spark does it, someone needs to look at the code, maybe write 
> some tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Volodymyr Lyubinets (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993023#comment-15993023
 ] 

Volodymyr Lyubinets commented on SPARK-20556:
-

Here's an edited code that produces this:

{code}
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.{col, lit}

val NON_TEMPORAL: String = "{\"type\":\"nontemporal\"}"
val joined = spark.sqlContext.read.parquet(BLAH)
val scores = joined.withColumn("temp", lit(NON_TEMPORAL)).select(col("x1"), 
col("temp"), col("x2"), col("x3"))
val results = scores.groupBy(col("x1"), col("temp"), 
col("x3")).agg(BLAH).take(1000)
{code}

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20561) Running SparkR with no RHive installed in secured environment

2017-05-02 Thread Natalie (JIRA)
Natalie created SPARK-20561:
---

 Summary: Running SparkR with no RHive installed in secured 
environment 
 Key: SPARK-20561
 URL: https://issues.apache.org/jira/browse/SPARK-20561
 Project: Spark
  Issue Type: Question
  Components: Examples, Input/Output
Affects Versions: 2.1.0
 Environment: Hadoop, Spark
Reporter: Natalie


I need to start running data mining analysis in secured environment (IP, Port, 
and database name are given), where Spark runs on hive tables. So I have 
installed R, SparkR, dplyr, and some other r libraries. 
Now I understand that I need to point sparkR to that database(with 
IP/Port/Name).
What should be my R code?
I start with evoking R,
then SparkR library.
Next I right sc<-sparkR.init()
it tells me immediately that spark-submit command:not found
Do I need to have RHive installed first? 
Or should I actually point somehow to spark library and to that database?
I couldn't find any documentation on that.
Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations

2017-05-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993008#comment-15993008
 ] 

Steve Loughran commented on SPARK-20560:


{{FileSystem.getFileBlockLocations(path)}} is only invoked from from 
{{HdfsUtils.getFileSegmentLocations}}, and used as a source of data for 
{{RDD.preferredLocations}}

I don't see anything explicit through the code that detects & reacts to the FS 
call returning localhost; I'll do some test downstream to see what surfaces 
against S3. Unless the scheduler has some explicit "localhost -> anywhere" map, 
it might make sense for HdfsUtils.getFileSegmentLocation to downgrade 
"localhost" to None, on the basis that in a cluster FS, the data clearly 
doesn't know where it is.



> Review Spark's handling of filesystems returning "localhost" in 
> getFileBlockLocations
> -
>
> Key: SPARK-20560
> URL: https://issues.apache.org/jira/browse/SPARK-20560
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Some filesystems (S3a, Azure WASB) return "localhost" as the response to 
> {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the 
> preferred host when scheduling work, there's a risk that work will be queued 
> on one host, rather than spread across the cluster.
> HIVE-14060 and TEZ-3291 have both seen it in their schedulers.
> I don't know if Spark does it, someone needs to look at the code, maybe write 
> some tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations

2017-05-02 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-20560:
--

 Summary: Review Spark's handling of filesystems returning 
"localhost" in getFileBlockLocations
 Key: SPARK-20560
 URL: https://issues.apache.org/jira/browse/SPARK-20560
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Steve Loughran
Priority: Minor


Some filesystems (S3a, Azure WASB) return "localhost" as the response to 
{{FileSystem.getFileBlockLocations(path)}}. If this is then used as the 
preferred host when scheduling work, there's a risk that work will be queued on 
one host, rather than spread across the cluster.

HIVE-14060 and TEZ-3291 have both seen it in their schedulers.

I don't know if Spark does it, someone needs to look at the code, maybe write 
some tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20559) Refreshing a cached RDD without restarting the Spark application

2017-05-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20559.
---
Resolution: Invalid

This should go to u...@spark.apache.org

> Refreshing a cached RDD without restarting the Spark application
> 
>
> Key: SPARK-20559
> URL: https://issues.apache.org/jira/browse/SPARK-20559
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jayesh lalwani
>
> We have a Structured Streaming application that gets accounts from Kafka into 
> a streaming data frame. We have a blacklist of accounts stored in S3 and we 
> want to filter out all the accounts that are blacklisted. So, we are loading 
> the blacklisted accounts into a batch data frame and joining it with the 
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we 
> wanted to cache the blacklist data frame to prevent going out to S3 
> everytime. Since, the blacklist might change, we want to be able to refresh 
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple 
> data frame. The steps we followed are
> * Create a CSV file
> * load CSV into a DF: df = spark.read.csv(filename)
> * Persist the data frame: df.persist
> * Now when we do df.show, we see the contents of the csv.
> * We change the CSV, and call df.show, we can see that the old contents are 
> being displayed, proving that the df is cached
> * df.unpersist
> * df.persist
> * df.show
> * What we see is that the rows that were modified in the CSV are reloaded.. 
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data 
> without restarting the Spark application?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20559) Refreshing a cached RDD without restarting the Spark application

2017-05-02 Thread Jayesh lalwani (JIRA)
Jayesh lalwani created SPARK-20559:
--

 Summary: Refreshing a cached RDD without restarting the Spark 
application
 Key: SPARK-20559
 URL: https://issues.apache.org/jira/browse/SPARK-20559
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Jayesh lalwani


We have a Structured Streaming application that gets accounts from Kafka into a 
streaming data frame. We have a blacklist of accounts stored in S3 and we want 
to filter out all the accounts that are blacklisted. So, we are loading the 
blacklisted accounts into a batch data frame and joining it with the streaming 
data frame to filter out the bad accounts.

Now, the blacklist doesn't change very often.. once a week at max. SO, we 
wanted to cache the blacklist data frame to prevent going out to S3 everytime. 
Since, the blacklist might change, we want to be able to refresh the cache at a 
cadence, without restarting the whole app.

So, to begin with we wrote a simple app that caches and refreshes a simple data 
frame. The steps we followed are

* Create a CSV file
* load CSV into a DF: df = spark.read.csv(filename)
* Persist the data frame: df.persist
* Now when we do df.show, we see the contents of the csv.
* We change the CSV, and call df.show, we can see that the old contents are 
being displayed, proving that the df is cached
* df.unpersist
* df.persist
* df.show
* What we see is that the rows that were modified in the CSV are reloaded.. But 
new rows aren't

Is this expected behavior? Is there a better way to refresh cached data without 
restarting the Spark application?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19582) DataFrameReader conceptually inadequate

2017-05-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992985#comment-15992985
 ] 

Steve Loughran commented on SPARK-19582:


All spark is doing is taking a URL To data, mapping that to an FS 
implementation classname and expecting that to implement the methods in 
`org.apache.hadoop.FileSystem` so as to provide FS-like behaviour.

Giving minio is nominally an S3 clone, sounds like there's a problem here 
setting up the hadoop S3a client to bind to it. I'd isolate that to the Hadoop 
code before going near Spark, test on Hadoop 2.8 & file bugs against Hadoop 
and/or minio if there are problems. AFAIK, nobody has run the Hadoop S3A 
[tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md]
 against minio; doing that and documenting how to configure the client would be 
a welcome contribution. If minio is 100% S3 compatible (c3/v4 auth + multipart 
PUT; encryption optional), then the S3A client should work with it...it could 
work as another integration test for minio.

> DataFrameReader conceptually inadequate
> ---
>
> Key: SPARK-19582
> URL: https://issues.apache.org/jira/browse/SPARK-19582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system, 
> object stores, jdbc, ...).  This seems limiting in the long term, imposing 
> both development costs to accept new sources and dependency issues for 
> existing sources (how to coordinate the XX jar for internal use vs. the XX 
> jar used by the application).  Unless I have missed how this can be done 
> currently, an application with an unsupported data source cannot create the 
> required RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the 
> application provide data as a String (or char[] or ...)---not a path, but the 
> actual data.  Alternatively, provide interfaces or abstract classes the 
> application could provide to let the application handle external data 
> sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate 
> place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a 
> new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  
> A few configuration/parameter values differ for minio, but one can use the 
> AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are 
> intercepted by spark and handed to hadoop.  So far, I have been unable to 
> find the right combination of configuration values and parameters to 
> "convince" hadoop to forward the right information to work with minio.  If I 
> could read the minio object in the application, and then hand the object 
> contents directly to spark, I could bypass hadoop and solve the problem.  
> Unfortunately, the underlying spark design prevents that.  So, I see two 
> problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API 
> details of all data sources.  This seems iffy in the long run (and is the 
> root of my current problem).  In the long run, it seems unwise to assume that 
> spark should understand all possible path names, protocols, etc.  Moreover, 
> passing S3 paths to hadoop seems a little odd (why not go directly to AWS, 
> for example).  This particular confusion about S3 shows the difficulties that 
> are bound to occur.
> -  Second, spark appears not to have a way to bypass the path name 
> interpretation.  At the least, spark could provide a text/blob interface, 
> letting the application supply the data object and avoid path interpretation 
> inside spark.  Alternatively, spark could accept a reader/stream/... to build 
> the object, again letting the application provide the implementation of the 
> object input.
> As I mentioned above, I might be missing something in the API that lets us 
> work around the problem.  I'll keep looking, but the API as apparently 
> structured seems too limiting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-02 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992955#comment-15992955
 ] 

Wenchen Fan commented on SPARK-20548:
-

I'll re-enable this test after https://github.com/apache/spark/pull/17833 is 
merged

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Gabor Feher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-20555:

Description: 
When querying an Oracle database, Spark maps some Oracle numeric data types to 
incorrect Catalyst data types:
1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9.
In Spark now, values larger than 1 become the boolean value true.
2. DECIMAL(3,2) becomes IntegerType
In Oracle, a DECIMAL(2) can have values like 1.23
In Spark now, digits after the decimal point are dropped.
3. DECIMAL(10) becomes IntegerType
In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
more than 2^31
Spark throws an exception: "java.sql.SQLException: Numeric Overflow"

I think the best solution is to always keep Oracle's decimal types. (In theory 
we could introduce a FloatType in some case of #2, and fix #3 by only 
introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
complicated and error-prone.)

Note: I think the above problems were introduced as part of  
https://github.com/apache/spark/pull/14377
The main purpose of that PR seems to be converting Spark types to correct 
Oracle types, and that part seems good to me. But it also adds the inverse 
conversions. As it turns out in the above examples, that is not possible.

  was:
When querying an Oracle database, Spark maps some Oracle numeric data types to 
incorrect Catalyst data types:
1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9.
In Spark now, values larger than 1 become the boolean value true.
2. DECIMAL(3,2) becomes IntegerType
In Oracle, a DECIMAL(2) can have values like 1.23
In Spark now, digits after the decimal point are dropped.
3. DECIMAL(10) becomes IntegerType
In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
more than 2^31
Spark throws an exception: "java.sql.SQLException: Numeric Overflow"

I think the best solution is to always keep Oracle's decimal types. (In theory 
we could introduce a FloatType in some case of #2, and fix #3 by only 
introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
complicated and error-prone.)

Note: I think the above problems were introduced as part of  
https://github.com/apache/spark/pull/14377/files
The main purpose of that PR seems to be converting Spark types to correct 
Oracle types, and that part seems good to me. But it also adds the inverse 
conversions. As it turns out in the above examples, that is not possible.


> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> https://github.com/apache/spark/pull/14377
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Gabor Feher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-20555:

Description: 
When querying an Oracle database, Spark maps some Oracle numeric data types to 
incorrect Catalyst data types:
1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9.
In Spark now, values larger than 1 become the boolean value true.
2. DECIMAL(3,2) becomes IntegerType
In Oracle, a DECIMAL(2) can have values like 1.23
In Spark now, digits after the decimal point are dropped.
3. DECIMAL(10) becomes IntegerType
In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
more than 2^31
Spark throws an exception: "java.sql.SQLException: Numeric Overflow"

I think the best solution is to always keep Oracle's decimal types. (In theory 
we could introduce a FloatType in some case of #2, and fix #3 by only 
introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
complicated and error-prone.)

Note: I think the above problems were introduced as part of  
https://github.com/apache/spark/pull/14377/files
The main purpose of that PR seems to be converting Spark types to correct 
Oracle types, and that part seems good to me. But it also adds the inverse 
conversions. As it turns out in the above examples, that is not possible.

  was:
When querying an Oracle database, Spark maps some Oracle numeric data types to 
incorrect Catalyst data types:
1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9.
In Spark now, values larger than 1 become the boolean value true.
2. DECIMAL(3,2) becomes IntegerType
In Oracle, a DECIMAL(2) can have values like 1.23
In Spark now, digits after the decimal point are dropped.
3. DECIMAL(10) becomes IntegerType
In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
more than 2^31
Spark throws an exception: "java.sql.SQLException: Numeric Overflow"

I think the best solution is to always keep Oracle's decimal types. (In theory 
we could introduce a FloatType in some case of #2, and fix #3 by only 
introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
complicated and error-prone.)

Note: I think the above problems were introduced as part of  
The main purpose of that PR seems to be converting Spark types to correct 
Oracle types, and that part seems good to me. But it also adds the inverse 
conversions. As it turns out in the above examples, that is not possible.


> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> https://github.com/apache/spark/pull/14377/files
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20558:


Assignee: Apache Spark  (was: Wenchen Fan)

> clear InheritableThreadLocal variables in SparkContext when stopping it
> ---
>
> Key: SPARK-20558
> URL: https://issues.apache.org/jira/browse/SPARK-20558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992941#comment-15992941
 ] 

Apache Spark commented on SPARK-20558:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17833

> clear InheritableThreadLocal variables in SparkContext when stopping it
> ---
>
> Key: SPARK-20558
> URL: https://issues.apache.org/jira/browse/SPARK-20558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20558:


Assignee: Wenchen Fan  (was: Apache Spark)

> clear InheritableThreadLocal variables in SparkContext when stopping it
> ---
>
> Key: SPARK-20558
> URL: https://issues.apache.org/jira/browse/SPARK-20558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992922#comment-15992922
 ] 

Apache Spark commented on SPARK-20557:
--

User 'JannikArndt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17832

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>  Labels: easyfix, jdbc, oracle, sql, timestamp
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20557:


Assignee: (was: Apache Spark)

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>  Labels: easyfix, jdbc, oracle, sql, timestamp
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20557:


Assignee: Apache Spark

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>Assignee: Apache Spark
>  Labels: easyfix, jdbc, oracle, sql, timestamp
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20558:
---

 Summary: clear InheritableThreadLocal variables in SparkContext 
when stopping it
 Key: SPARK-20558
 URL: https://issues.apache.org/jira/browse/SPARK-20558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.2, 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-02 Thread Jessie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992886#comment-15992886
 ] 

Jessie Yu commented on SPARK-20546:
---

Given the current code relies on posix mode being off, turning it off 
explicitly shouldn't affect other behavior, especially since the change is 
confined to just the spark-class subshell. 

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Priority: Minor
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-05-02 Thread Jannik Arndt (JIRA)
Jannik Arndt created SPARK-20557:


 Summary: JdbcUtils doesn't support 
java.sql.Types.TIMESTAMP_WITH_TIMEZONE
 Key: SPARK-20557
 URL: https://issues.apache.org/jira/browse/SPARK-20557
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.3.0
Reporter: Jannik Arndt


Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME ZONE 
via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) results 
in an error:

{{Unsupported type -101}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
{{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}

That is because the type 
{{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
 (in Java since 1.8) is missing in 
{{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
 

This is similar to SPARK-7039.

I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20395) Upgrade to Scala 2.11.11

2017-05-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992814#comment-15992814
 ] 

Sean Owen commented on SPARK-20395:
---

[~jeremyrsmith] I tried updating to Scala 2.11.11 today and it worked fine, 
including generating docs, with unidoc + genjavadoc 0.10. What issue should I 
be looking for? if not, this could go in to master for 2.3.

> Upgrade to Scala 2.11.11
> 
>
> Key: SPARK-20395
> URL: https://issues.apache.org/jira/browse/SPARK-20395
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Jeremy Smith
>Priority: Minor
>  Labels: dependencies, scala
>
> Update Scala to 2.11.11, which was released yesterday:
> https://github.com/scala/scala/releases/tag/v2.11.11
> Since it's a patch version upgrade and binary compatibility is guaranteed, 
> impact should be minimal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992813#comment-15992813
 ] 

Sean Owen commented on SPARK-20556:
---

Do you have a reproduction?

> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private long emptyVOff;
> /* 098 */ private int emptyVLen;
> /* 099 */ private boolean isBatchFull = false;
> /* 100 */
> It looks like on line 93 it failed to escape that string (that happened to be 
> in my code). I'm not sure how critical this is, but seems like there's 
> escaping missing somewhere.
> Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18777:


Assignee: (was: Apache Spark)

> Return UDF objects when registering from Python
> ---
>
> Key: SPARK-18777
> URL: https://issues.apache.org/jira/browse/SPARK-18777
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>
> In Scala when registering a UDF it gives you back a UDF object that you can 
> use in the Dataset/DataFrame API as well as with SQL expressions. We can do 
> the same in Python, for both Python UDFs and Java UDFs registered from Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18777:


Assignee: Apache Spark

> Return UDF objects when registering from Python
> ---
>
> Key: SPARK-18777
> URL: https://issues.apache.org/jira/browse/SPARK-18777
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>Assignee: Apache Spark
>
> In Scala when registering a UDF it gives you back a UDF object that you can 
> use in the Dataset/DataFrame API as well as with SQL expressions. We can do 
> the same in Python, for both Python UDFs and Java UDFs registered from Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18777) Return UDF objects when registering from Python

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992812#comment-15992812
 ] 

Apache Spark commented on SPARK-18777:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17831

> Return UDF objects when registering from Python
> ---
>
> Key: SPARK-18777
> URL: https://issues.apache.org/jira/browse/SPARK-18777
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>
> In Scala when registering a UDF it gives you back a UDF object that you can 
> use in the Dataset/DataFrame API as well as with SQL expressions. We can do 
> the same in Python, for both Python UDFs and Java UDFs registered from Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Volodymyr Lyubinets (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Volodymyr Lyubinets updated SPARK-20556:

Description: 
I guess somewhere along the way Spark uses codehaus to generate optimized code, 
but if it fails to do so, it falls back to an alternative way. Here's a log 
string that I see when executing one command on dataframes:

17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, 
Column 13: ')' expected instead of 'type'

...

/* 088 */ private double loadFactor = 0.5;
/* 089 */ private int numBuckets = (int) (capacity / loadFactor);
/* 090 */ private int maxSteps = 2;
/* 091 */ private int numRows = 0;
/* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 093 */ .add("{"type":"nontemporal"}", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 094 */ .add("spatialPayload", 
org.apache.spark.sql.types.DataTypes.StringType);
/* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.DoubleType);
/* 096 */ private Object emptyVBase;
/* 097 */ private long emptyVOff;
/* 098 */ private int emptyVLen;
/* 099 */ private boolean isBatchFull = false;
/* 100 */

It looks like on line 93 it failed to escape that string (that happened to be 
in my code). I'm not sure how critical this is, but seems like there's escaping 
missing somewhere.

Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0

  was:
I guess somewhere along the way Spark uses codehaus to generate optimized code, 
but if it fails to do so, it falls back to an alternative way. Here's a log 
string that I see when executing one command on dataframes:

17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, 
Column 13: ')' expected instead of 'type'

...

/* 088 */ private double loadFactor = 0.5;
/* 089 */ private int numBuckets = (int) (capacity / loadFactor);
/* 090 */ private int maxSteps = 2;
/* 091 */ private int numRows = 0;
/* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 093 */ .add("{"type":"nontemporal"}", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 094 */ .add("spatialPayload", 
org.apache.spark.sql.types.DataTypes.StringType);
/* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.DoubleType);
/* 096 */ private Object emptyVBase;
/* 097 */ private long emptyVOff;
/* 098 */ private int emptyVLen;
/* 099 */ private boolean isBatchFull = false;
/* 100 */

It looks like on line 93 it failed to escape that string (that happened to be 
in my code). I'm not sure how critical this is, but seems like there's escaping 
missing somewhere.


> codehaus fails to generate code because of unescaped strings
> 
>
> Key: SPARK-20556
> URL: https://issues.apache.org/jira/browse/SPARK-20556
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Volodymyr Lyubinets
>
> I guess somewhere along the way Spark uses codehaus to generate optimized 
> code, but if it fails to do so, it falls back to an alternative way. Here's a 
> log string that I see when executing one command on dataframes:
> 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 93, Column 13: ')' expected instead of 'type'
> ...
> /* 088 */ private double loadFactor = 0.5;
> /* 089 */ private int numBuckets = (int) (capacity / loadFactor);
> /* 090 */ private int maxSteps = 2;
> /* 091 */ private int numRows = 0;
> /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 093 */ .add("{"type":"nontemporal"}", 
> org.apache.spark.sql.types.DataTypes.StringType)
> /* 094 */ .add("spatialPayload", 
> org.apache.spark.sql.types.DataTypes.StringType);
> /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.DoubleType);
> /* 096 */ private Object emptyVBase;
> /* 097 */ private 

[jira] [Created] (SPARK-20556) codehaus fails to generate code because of unescaped strings

2017-05-02 Thread Volodymyr Lyubinets (JIRA)
Volodymyr Lyubinets created SPARK-20556:
---

 Summary: codehaus fails to generate code because of unescaped 
strings
 Key: SPARK-20556
 URL: https://issues.apache.org/jira/browse/SPARK-20556
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Volodymyr Lyubinets


I guess somewhere along the way Spark uses codehaus to generate optimized code, 
but if it fails to do so, it falls back to an alternative way. Here's a log 
string that I see when executing one command on dataframes:

17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, 
Column 13: ')' expected instead of 'type'

...

/* 088 */ private double loadFactor = 0.5;
/* 089 */ private int numBuckets = (int) (capacity / loadFactor);
/* 090 */ private int maxSteps = 2;
/* 091 */ private int numRows = 0;
/* 092 */ private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("taxonomyPayload", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 093 */ .add("{"type":"nontemporal"}", 
org.apache.spark.sql.types.DataTypes.StringType)
/* 094 */ .add("spatialPayload", 
org.apache.spark.sql.types.DataTypes.StringType);
/* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new 
org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.DoubleType);
/* 096 */ private Object emptyVBase;
/* 097 */ private long emptyVOff;
/* 098 */ private int emptyVLen;
/* 099 */ private boolean isBatchFull = false;
/* 100 */

It looks like on line 93 it failed to escape that string (that happened to be 
in my code). I'm not sure how critical this is, but seems like there's escaping 
missing somewhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2017-05-02 Thread Ramgopal N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992745#comment-15992745
 ] 

Ramgopal N commented on SPARK-3528:
---

I have spark running on Mesos. 
Mesos agents are running on node1,node2,node3 and datanodes on node4,node5 and 
node6.
I see 3 executors running one on each of the 3 Mesos agents. So in my case 
PROCESS_LOCAL and NODE_LOCAL are same i believe.
Basically i am trying to check the spark sql performance when there is no 
datalocality.

When i execute spark sql, all the tasks are showing as PROCESS_LOCAL.
what is importance of "spark.locality.wait.process" for spark on mesos. Is this 
configuration applicable for standalone spark?

> Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
> 
>
> Key: SPARK-3528
> URL: https://issues.apache.org/jira/browse/SPARK-3528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
> {noformat}
> spark> sc.textFile("pom.xml").count
> ...
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:20862+20863
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:0+20862
> {noformat}
> There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
> {noformat}
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> // TODO: Filtering out "localhost" in case of file:// URLs
> val hadoopSplit = split.asInstanceOf[HadoopPartition]
> hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-05-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992699#comment-15992699
 ] 

Hyukjin Kwon commented on SPARK-19019:
--

To solve this problem fully, I had to port cloudpickle change too in the PR. 
Only fixing hijected one described above dose not fully solve this issue. 
Please refer the discussion in the PR and the change.

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-05-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19019:
-
Fix Version/s: 2.0.3
   1.6.4

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20555:


Assignee: (was: Apache Spark)

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20555:


Assignee: Apache Spark

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>Assignee: Apache Spark
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992685#comment-15992685
 ] 

Apache Spark commented on SPARK-20555:
--

User 'gaborfeher' has created a pull request for this issue:
https://github.com/apache/spark/pull/17830

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-05-02 Thread Gabor Feher (JIRA)
Gabor Feher created SPARK-20555:
---

 Summary: Incorrect handling of Oracle's decimal types via JDBC
 Key: SPARK-20555
 URL: https://issues.apache.org/jira/browse/SPARK-20555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Gabor Feher


When querying an Oracle database, Spark maps some Oracle numeric data types to 
incorrect Catalyst data types:
1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9.
In Spark now, values larger than 1 become the boolean value true.
2. DECIMAL(3,2) becomes IntegerType
In Oracle, a DECIMAL(2) can have values like 1.23
In Spark now, digits after the decimal point are dropped.
3. DECIMAL(10) becomes IntegerType
In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
more than 2^31
Spark throws an exception: "java.sql.SQLException: Numeric Overflow"

I think the best solution is to always keep Oracle's decimal types. (In theory 
we could introduce a FloatType in some case of #2, and fix #3 by only 
introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
complicated and error-prone.)

Note: I think the above problems were introduced as part of  
The main purpose of that PR seems to be converting Spark types to correct 
Oracle types, and that part seems good to me. But it also adds the inverse 
conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18891) Support for specific collection types

2017-05-02 Thread Nils Grabbert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992672#comment-15992672
 ] 

Nils Grabbert commented on SPARK-18891:
---

Example in SPARK-19104 still not working.

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
> Fix For: 2.2.0
>
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-02 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992614#comment-15992614
 ] 

Umesh Chaudhary commented on SPARK-20554:
-

[~srowen] I can work on this. Currently seeing 15 occurrences of 
scala.language.reflectiveCalls imports. Will re-evaluate the warnings after 
removing the imports of reflectiveCalls. 

> Remove usage of scala.language.reflectiveCalls
> --
>
> Key: SPARK-20554
> URL: https://issues.apache.org/jira/browse/SPARK-20554
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Priority: Minor
>
> In several parts of the code we have imported 
> {{scala.language.reflectiveCalls}} to suppress a warning about, well, 
> reflective calls. I know from cleaning up build warnings in 2.2 that in 
> almost all cases of this are inadvertent and masking a type problem.
> Example, in HiveDDLSuite:
> {code}
> val expectedTablePath =
>   if (dbPath.isEmpty) {
> hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
>   } else {
> new Path(new Path(dbPath.get), tableIdentifier.table)
>   }
> val filesystemPath = new Path(expectedTablePath.toString)
> {code}
> This shouldn't really work because one branch returns a URI and the other a 
> Path. In this case it only needs an object with a toString method and can 
> make this work with structural types and reflection.
> Obviously, the intent was to add ".toURI" to the second branch though to make 
> both a URI!
> I think we should probably clean this up by taking out all imports of 
> reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
> legit usages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-02 Thread Sean Owen (JIRA)
Sean Owen created SPARK-20554:
-

 Summary: Remove usage of scala.language.reflectiveCalls
 Key: SPARK-20554
 URL: https://issues.apache.org/jira/browse/SPARK-20554
 Project: Spark
  Issue Type: Improvement
  Components: ML, Spark Core, SQL, Structured Streaming
Affects Versions: 2.1.0
Reporter: Sean Owen
Priority: Minor


In several parts of the code we have imported 
{{scala.language.reflectiveCalls}} to suppress a warning about, well, 
reflective calls. I know from cleaning up build warnings in 2.2 that in almost 
all cases of this are inadvertent and masking a type problem.

Example, in HiveDDLSuite:

{code}
val expectedTablePath =
  if (dbPath.isEmpty) {
hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
  } else {
new Path(new Path(dbPath.get), tableIdentifier.table)
  }
val filesystemPath = new Path(expectedTablePath.toString)
{code}

This shouldn't really work because one branch returns a URI and the other a 
Path. In this case it only needs an object with a toString method and can make 
this work with structural types and reflection.

Obviously, the intent was to add ".toURI" to the second branch though to make 
both a URI!

I think we should probably clean this up by taking out all imports of 
reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
legit usages.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20553) Update ALS examples for ML to illustrate recommend all

2017-05-02 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20553:
--

 Summary: Update ALS examples for ML to illustrate recommend all
 Key: SPARK-20553
 URL: https://issues.apache.org/jira/browse/SPARK-20553
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Nick Pentreath
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-05-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20300.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17622
[https://github.com/apache/spark/pull/17622]

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
> Fix For: 2.2.0
>
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20411) New features for expression.scalalang.typed

2017-05-02 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992517#comment-15992517
 ] 

Jason Moore commented on SPARK-20411:
-

And, ideally, anything else within org.apache.spark.sql.functions (e.g. 
countDistinct).  We're looking to replace our use of DataFrames with Datasets, 
which means finding a replacement for all the aggregation functions that we 
use.  If I end up putting together some functions myself, I'll pop back here to 
contribute them.

> New features for expression.scalalang.typed
> ---
>
> Key: SPARK-20411
> URL: https://issues.apache.org/jira/browse/SPARK-20411
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Loic Descotte
>Priority: Minor
>
> In Spark 2 it is possible to use typed expressions for aggregation methods: 
> {code}
> import org.apache.spark.sql.expressions.scalalang._ 
> dataset.groupByKey(_.productId).agg(typed.sum[Token](_.score)).toDF("productId",
>  "sum").orderBy('productId).show
> {code}
> It seems that only avg, count and sum are defined : 
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/expressions/scalalang/typed.html
> It is very nice to be able to use a typesafe DSL, but it would be good to 
> have more possibilities, like min and max functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992497#comment-15992497
 ] 

Nick Pentreath commented on SPARK-20443:


Interesting - though it appears to me that {{2048}} is the best setting for 
both data sizes. At the least I think we should adjust the default.

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Teng Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992492#comment-15992492
 ] 

Teng Jiang commented on SPARK-20443:


All the tests above were did with SPARK-11968 / [PR #17742 | 
https://github.com/apache/spark/pull/17742]. 
The blockSize still makes sense considering the times of data fetching per 
iteration and the GC time.

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992491#comment-15992491
 ] 

Nick Pentreath commented on SPARK-20551:


Yes I agree that it appears you're trying to import Java or Scala classes in 
Python which won't work. I suggest you post a question to the Spark user list 
asking for help: u...@spark.apache.org. Please indicate what you are trying to 
do and provide some example code (it appears that you're trying to read a 
custom Hadoop {{InputFormat}} in PySpark?

> ImportError adding custom class from jar in pyspark
> ---
>
> Key: SPARK-20551
> URL: https://issues.apache.org/jira/browse/SPARK-20551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.1.0
>Reporter: Sergio Monteiro
>
> the flowwing imports are failing in PySpark, even when I set the --jars or 
> --driver-class-path:
> import net.ripe.hadoop.pcap.io.PcapInputFormat
> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> import net.ripe.hadoop.pcap.packet.Packet
> Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
> SparkSession available as 'spark'.
> >>> import net.ripe.hadoop.pcap.io.PcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat
> >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> >>> import net.ripe.hadoop.pcap.packet.Packet
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.packet.Packet
> >>>
> The same works great in spark-shell. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-20551.
--
Resolution: Not A Problem

> ImportError adding custom class from jar in pyspark
> ---
>
> Key: SPARK-20551
> URL: https://issues.apache.org/jira/browse/SPARK-20551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.1.0
>Reporter: Sergio Monteiro
>
> the flowwing imports are failing in PySpark, even when I set the --jars or 
> --driver-class-path:
> import net.ripe.hadoop.pcap.io.PcapInputFormat
> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> import net.ripe.hadoop.pcap.packet.Packet
> Using Python version 2.7.12 (default, Nov 19 2016 06:48:10)
> SparkSession available as 'spark'.
> >>> import net.ripe.hadoop.pcap.io.PcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat
> >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat
> >>> import net.ripe.hadoop.pcap.packet.Packet
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named net.ripe.hadoop.pcap.packet.Packet
> >>>
> The same works great in spark-shell. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20552) Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala/Java and Python

2017-05-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20552.
--
Resolution: Won't Fix

I am resolving this. Please refer the discussion in the PR.

>  Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala/Java and Python
> --
>
> Key: SPARK-20552
> URL: https://issues.apache.org/jira/browse/SPARK-20552
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After SPARK-20463, we are able to use {{IS [NOT] DISTINCT FROM}} in Spark SQL.
> It looks we should add {{isNotDistinctFrom}} (as an alias for {{eqNullSafe}}) 
> and {{isDistinctFrom}} (for a negated {{eqNullSafe}}) in both Scala/Java and 
> Python in Column APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992475#comment-15992475
 ] 

Nick Pentreath commented on SPARK-20443:


Were these tests against existing master? Because SPARK-11968 / [PR 
#17742|https://github.com/apache/spark/pull/17742] should make block size less 
relevant - we should of course re-test this once that PR is merged in, to see 
if it's worth exposing the parameter.

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Teng Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992471#comment-15992471
 ] 

Teng Jiang commented on SPARK-20443:


I did some tests on the blockSize. 
The test environment is:
3 workers: each work 40 core, each worker 180G memory, each worker 1 executor.
The Data: user 3,290,000, and item 208,000
The results are:
blockSize  rank=10   rank = 100
128  67.32min   127.66min 
256  46.68min   87.67min 
512  35.66min   63.46min
1024 28.49min   41.61min
2048 22.83min   34.76min
4096 22.39min   54.43min
8192 23.35min   71.09min

Another dataset with 480,000 users and 17,000 items. The rank was set to 10.
blockSize 128 256 512 1024   2048   4096   8192
time (s)98.270.452.7 45.3   45.060.5 67.3

For both datasets, with the blockSize grows from 128 to 8192, the recommend 
time first decreases and then increases.
Therefore, for different datasets, the optimal blockSize is different. 


> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20549) java.io.CharConversionException: Invalid UTF-32 in JsonToStructs

2017-05-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20549.
-
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.3.0
   2.2.1

> java.io.CharConversionException: Invalid UTF-32 in JsonToStructs
> 
>
> Key: SPARK-20549
> URL: https://issues.apache.org/jira/browse/SPARK-20549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.2.1, 2.3.0
>
>
> The same fix for SPARK-16548 needs to be applied for JsonToStructs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org