[jira] [Resolved] (SPARK-20567) Failure to bind when using explode and collect_set in streaming
[ https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20567. --- Resolution: Fixed Fix Version/s: 2.2.0 > Failure to bind when using explode and collect_set in streaming > --- > > Key: SPARK-20567 > URL: https://issues.apache.org/jira/browse/SPARK-20567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 2.2.0 > > > Here is a small test case: > {code} > test("count distinct") { > val inputData = MemoryStream[(Int, Seq[Int])] > val aggregated = > inputData.toDF() > .select($"*", explode($"_2") as 'value) > .groupBy($"_1") > .agg(size(collect_set($"value"))) > .as[(Int, Int)] > testStream(aggregated, Update)( > AddData(inputData, (1, Seq(1, 2))), > CheckLastBatch((1, 2)) > ) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20573) --packages fails when transitive dependency can only be resolved from repository specified in POM's tag
Josh Rosen created SPARK-20573: -- Summary: --packages fails when transitive dependency can only be resolved from repository specified in POM's tag Key: SPARK-20573 URL: https://issues.apache.org/jira/browse/SPARK-20573 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.1.0, 2.0.0 Reporter: Josh Rosen With a clean Ivy cache, run the following command: {code} ./bin/spark-shell --packages com.twitter.elephantbird:elephant-bird-core:4.4 {code} This will fail with {{unresolved dependency: com.hadoop.gplcompression#hadoop-lzo;0.4.16: not found}}. If you look at the elephant-bird-core POM (at http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.4/elephant-bird-core-4.4.pom) you'll see a direct dependency on hadoop-lzo. This library is only present in Twitter's public Maven repository, hosted at http://maven.twttr.com.The elephant-bird-core POM does not directly declare Twitter's external repository. Instead, that external repository is inherited from elephant-bird-core's parent POM (at http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird/4.4/elephant-bird-4.4.pom). >From the Ivy output it looks like it it didn't even attempt to resolve from >the Twitter repo: {code} :: problems summary :: WARNINGS module not found: com.hadoop.gplcompression#hadoop-lzo;0.4.16 local-m2-cache: tried file:/Users/joshrosen/.m2/repository/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom -- artifact com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar: file:/Users/joshrosen/.m2/repository/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar local-ivy-cache: tried /Users/joshrosen/.ivy2/local/com.hadoop.gplcompression/hadoop-lzo/0.4.16/ivys/ivy.xml -- artifact com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar: /Users/joshrosen/.ivy2/local/com.hadoop.gplcompression/hadoop-lzo/0.4.16/jars/hadoop-lzo.jar central: tried https://repo1.maven.org/maven2/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom -- artifact com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar: https://repo1.maven.org/maven2/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar spark-packages: tried http://dl.bintray.com/spark-packages/maven/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.pom -- artifact com.hadoop.gplcompression#hadoop-lzo;0.4.16!hadoop-lzo.jar: http://dl.bintray.com/spark-packages/maven/com/hadoop/gplcompression/hadoop-lzo/0.4.16/hadoop-lzo-0.4.16.jar :: :: UNRESOLVED DEPENDENCIES :: :: :: com.hadoop.gplcompression#hadoop-lzo;0.4.16: not found :: {code} If you manually specify the Twitter repository as an additional external repository then everything works fine. This is a somewhat frustrating behavior from an end-user's point of view because unless they dig through the POMs themselves it is not obvious why things are broken or how to fix them. When Maven resolves this coordinate it properly fetches the transitive dependencies from the additional repositories specified in the referencing POMs. My hunch is that this behavior is caused by either a bug in Ivy itself or a bug in Spark's usage / configuration of the embedded Ivy resolver. It would be great to see if we can find other test-cases to narrow down the scope of the bug. I'm wondering whether POM-specified repositories will work if they're specified in the POM of the top-level dependency being resolved. It would also be useful to determine whether Ivy handles additional repositories in the top-level of transitive dependencies' POMs: maybe the problem is the specific combination of transitive dep + repository inherited from that dep's parent POM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4836) Web UI should display separate information for all stage attempts
[ https://issues.apache.org/jira/browse/SPARK-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994213#comment-15994213 ] Josh Rosen commented on SPARK-4836: --- [~ckadner], I'm pretty sure that this is still a problem. Regarding reproductions, you should be able to trigger this by triggering a fetch failure: run a shuffle stage, then delete some random portion of shuffle outputs while the reduce stage is running. The reduce stage should fail and re-run the previous map stage, leading to a second stage attempt. > Web UI should display separate information for all stage attempts > - > > Key: SPARK-4836 > URL: https://issues.apache.org/jira/browse/SPARK-4836 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen > > I've run into some cases where the web UI job page will say that a job took > 12 minutes but the sum of that job's stage times is something like 10 > seconds. In this case, it turns out that my job ran a stage to completion > (which took, say, 5 minutes) then lost some partitions of that stage and had > to run a new stage attempt to recompute one or two tasks from that stage. As > a result, the latest attempt for that stage reports only one or two tasks. > In the web UI, it seems that we only show the latest stage attempt, not all > attempts, which can lead to confusing / misleading displays for jobs with > failed / partially-recomputed stages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20572) Spark Streaming fail to read file on Hdfs
LvDongrong created SPARK-20572: -- Summary: Spark Streaming fail to read file on Hdfs Key: SPARK-20572 URL: https://issues.apache.org/jira/browse/SPARK-20572 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.1.0 Reporter: LvDongrong When I move a file on hdfs to the target directory where Spark Streaming read from, Spark Streaming fail to read the file. Then I found the Spark Streaming Hdfs interface only read the file whose modtime is in the interval of the current batch,but when we move the file to the target directory,the modtime of the file is never changed. May be the method to find new file is not appropriate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors
[ https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994199#comment-15994199 ] Ethan Xu commented on SPARK-12009: -- I'm getting similar error message in with Spark 2.1.0. I can't reproduce it. The exact same code worked fine on a small RDD (sample), but sometimes gave this error on large RDD after hours of ran. It's very frustrating. > Avoid re-allocate yarn container while driver want to stop all Executors > > > Key: SPARK-12009 > URL: https://issues.apache.org/jira/browse/SPARK-12009 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.2 >Reporter: SuYan >Assignee: SuYan >Priority: Minor > Fix For: 2.0.0 > > > Log based 1.4.0 > 2015-11-26,03:05:16,176 WARN > org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be > stopped > 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web > UI at http:// > 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: > Stopping DAGScheduler > 2015-11-26,03:05:16,450 INFO > org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down > all executors > 2015-11-26,03:05:16,525 INFO > org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each > executor to shut down > 2015-11-26,03:05:16,791 INFO > org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated > or disconnected! Shutting down. XX.XX.XX.XX:38734 > 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate(164,WrappedArray()) > 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will > request 13 executor containers, each with 1 cores and 4608 MB memory > including 1024 MB overhead -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994192#comment-15994192 ] Supriya Pasham edited comment on SPARK-12216 at 5/3/17 2:46 AM: Hi Team, I am executing 'spark-submit' with a jar and properties file in the below manner -> spark-submit --class package.classname --master local[*] \Spark.jar data.properties When i run the above command, immediately 2-3 exceptions are displayed in the command prompt with below exception details. I have seen that this is issue is marked as resolved, but i dint fin correct resolution. Please let me know if there is a solution to this issue - ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387 855 java.io.IOException: Failed to delete: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855 Environment details : I am running the commands in Windows 7 machine Request you to provide a solution asap. was (Author: supriya): Hi Team, I am executing 'spark-submit' with a jar and properties file in the below manner -> spark-submit --class package.classname --master local[*] \Spark.jar data.properties When i run the above command, immediately 2-3 exceptions are displayed in the command prompt with below exception details. I have seen that this is issue is marked as resolved, but i dint fin correct resolution. Please let me know if there is a solution to this issue - ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387 855 java.io.IOException: Failed to delete: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855 Request you to provide a solution asap. > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994192#comment-15994192 ] Supriya Pasham commented on SPARK-12216: Hi Team, I am executing 'spark-submit' with a jar and properties file in the below manner -> spark-submit --class package.classname --master local[*] \Spark.jar data.properties When i run the above command, immediately 2-3 exceptions are displayed in the command prompt with below exception details. I have seen that this is issue is marked as resolved, but i dint fin correct resolution. Please let me know if there is a solution to this issue - ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387 855 java.io.IOException: Failed to delete: C:\Users\user1\AppData\Local\Temp\spark-5e37d680-2e9f-4aed-ac59-2f24d8387855 Request you to provide a solution asap. > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests
[ https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994185#comment-15994185 ] Burak Yavuz commented on SPARK-20571: - cc [~felixcheung] > Flaky SparkR StructuredStreaming tests > -- > > Key: SPARK-20571 > URL: https://issues.apache.org/jira/browse/SPARK-20571 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20571) Flaky SparkR StructuredStreaming tests
Burak Yavuz created SPARK-20571: --- Summary: Flaky SparkR StructuredStreaming tests Key: SPARK-20571 URL: https://issues.apache.org/jira/browse/SPARK-20571 Project: Spark Issue Type: Test Components: SparkR, Structured Streaming Affects Versions: 2.2.0 Reporter: Burak Yavuz https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it
[ https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20558. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 2.0.3 > clear InheritableThreadLocal variables in SparkContext when stopping it > --- > > Key: SPARK-20558 > URL: https://issues.apache.org/jira/browse/SPARK-20558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20570) The main version number on docs/latest/index.html
liucht-inspur created SPARK-20570: - Summary: The main version number on docs/latest/index.html Key: SPARK-20570 URL: https://issues.apache.org/jira/browse/SPARK-20570 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.1.1 Reporter: liucht-inspur On the spark.apache.org home page, when I click the menu Latest Release (Spark 2.1.1) under the documentation menu ,the next page latest appear with display 2.1.0 lable in the upper left corner of the page -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20569) In spark-sql,Some functions can execute successfully, when the number of input parameter is wrong
liuxian created SPARK-20569: --- Summary: In spark-sql,Some functions can execute successfully, when the number of input parameter is wrong Key: SPARK-20569 URL: https://issues.apache.org/jira/browse/SPARK-20569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: liuxian Priority: Trivial >select Nvl(null,'1',3); >3 The function of "Nvl" has Only two input parameters,so, when input three parameters, i think it should notice that:"Error in query: Invalid number of arguments for function nvl". Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20568) Delete files after processing
[ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saul Shanabrook updated SPARK-20568: Description: It would be great to be able to delete files after processing them with structured streaming. For example, I am reading in a bunch of JSON files and converting them into Parquet. If the JSON files are not deleted after they are processed, it quickly fills up my hard drive. I originally [posted this on Stack Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to make a feature request for it. was: It would be great to be able to delete files after processing them with structured streaming. For example, I am reading in a bunch of JSON files and converting them into Parquet. If the JSON files are not deleted after they are processed, it quickly fills up my hard drive. I originally [posted this on Stack Overflow](http://stackoverflow.com/q/43671757/907060) and was recommended to make a feature request for it. > Delete files after processing > - > > Key: SPARK-20568 > URL: https://issues.apache.org/jira/browse/SPARK-20568 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Saul Shanabrook > > It would be great to be able to delete files after processing them with > structured streaming. > For example, I am reading in a bunch of JSON files and converting them into > Parquet. If the JSON files are not deleted after they are processed, it > quickly fills up my hard drive. I originally [posted this on Stack > Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to > make a feature request for it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20568) Delete files after processing
Saul Shanabrook created SPARK-20568: --- Summary: Delete files after processing Key: SPARK-20568 URL: https://issues.apache.org/jira/browse/SPARK-20568 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Saul Shanabrook It would be great to be able to delete files after processing them with structured streaming. For example, I am reading in a bunch of JSON files and converting them into Parquet. If the JSON files are not deleted after they are processed, it quickly fills up my hard drive. I originally [posted this on Stack Overflow](http://stackoverflow.com/q/43671757/907060) and was recommended to make a feature request for it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20567) Failure to bind when using explode and collect_set in streaming
[ https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20567: Assignee: Michael Armbrust (was: Apache Spark) > Failure to bind when using explode and collect_set in streaming > --- > > Key: SPARK-20567 > URL: https://issues.apache.org/jira/browse/SPARK-20567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > > Here is a small test case: > {code} > test("count distinct") { > val inputData = MemoryStream[(Int, Seq[Int])] > val aggregated = > inputData.toDF() > .select($"*", explode($"_2") as 'value) > .groupBy($"_1") > .agg(size(collect_set($"value"))) > .as[(Int, Int)] > testStream(aggregated, Update)( > AddData(inputData, (1, Seq(1, 2))), > CheckLastBatch((1, 2)) > ) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20567) Failure to bind when using explode and collect_set in streaming
[ https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20567: Assignee: Apache Spark (was: Michael Armbrust) > Failure to bind when using explode and collect_set in streaming > --- > > Key: SPARK-20567 > URL: https://issues.apache.org/jira/browse/SPARK-20567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > > Here is a small test case: > {code} > test("count distinct") { > val inputData = MemoryStream[(Int, Seq[Int])] > val aggregated = > inputData.toDF() > .select($"*", explode($"_2") as 'value) > .groupBy($"_1") > .agg(size(collect_set($"value"))) > .as[(Int, Int)] > testStream(aggregated, Update)( > AddData(inputData, (1, Seq(1, 2))), > CheckLastBatch((1, 2)) > ) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20567) Failure to bind when using explode and collect_set in streaming
[ https://issues.apache.org/jira/browse/SPARK-20567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994085#comment-15994085 ] Apache Spark commented on SPARK-20567: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/17838 > Failure to bind when using explode and collect_set in streaming > --- > > Key: SPARK-20567 > URL: https://issues.apache.org/jira/browse/SPARK-20567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > > Here is a small test case: > {code} > test("count distinct") { > val inputData = MemoryStream[(Int, Seq[Int])] > val aggregated = > inputData.toDF() > .select($"*", explode($"_2") as 'value) > .groupBy($"_1") > .agg(size(collect_set($"value"))) > .as[(Int, Int)] > testStream(aggregated, Update)( > AddData(inputData, (1, Seq(1, 2))), > CheckLastBatch((1, 2)) > ) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20567) Failure to bind when using explode and collect_set in streaming
Michael Armbrust created SPARK-20567: Summary: Failure to bind when using explode and collect_set in streaming Key: SPARK-20567 URL: https://issues.apache.org/jira/browse/SPARK-20567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Here is a small test case: {code} test("count distinct") { val inputData = MemoryStream[(Int, Seq[Int])] val aggregated = inputData.toDF() .select($"*", explode($"_2") as 'value) .groupBy($"_1") .agg(size(collect_set($"value"))) .as[(Int, Int)] testStream(aggregated, Update)( AddData(inputData, (1, Seq(1, 2))), CheckLastBatch((1, 2)) ) } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy
[ https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993888#comment-15993888 ] Josh Rosen commented on SPARK-10878: [~jeeyoungk], my understanding is that there are two possible sources of races here: 1. Racing on the module descriptor / POM that we write out. Your suggestion of using a dummy string would address that. 2. Racing on writes to other files in the Ivy resolution cache. This is more likely to occur when the cache is initially clean and would manifest itself as corruption in other files. Given this, I think the most robust solution is to use separate resolution caches since that would solve both problems in one fell swoop. If that's going to be hard then the partial fix of randomizing the module descriptor version (say, to use a timestamp) isn't a bad idea because it's pretty cheap to implement and probably covers the majority of real-world races here. > Race condition when resolving Maven coordinates via Ivy > --- > > Key: SPARK-10878 > URL: https://issues.apache.org/jira/browse/SPARK-10878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Ryan Williams >Priority: Minor > > I've recently been shell-scripting the creation of many concurrent > Spark-on-YARN apps and observing a fraction of them to fail with what I'm > guessing is a race condition in their Maven-coordinate resolution. > For example, I might spawn an app for each path in file {{paths}} with the > following shell script: > {code} > cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}" > {code} > When doing this, I observe some fraction of the spawned jobs to fail with > errors like: > {code} > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > Exception in thread "main" java.lang.RuntimeException: problem during > retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: > failed to parse report: > /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: > Premature end of file. > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249) > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83) > at org.apache.ivy.Ivy.retrieve(Ivy.java:551) > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006) > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.text.ParseException: failed to parse report: > /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: > Premature end of file. > at > org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293) > at > org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329) > at > org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118) > ... 7 more > Caused by: org.xml.sax.SAXParseException; Premature end of file. > at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > {code} > The more apps I try to launch simultaneously, the greater fraction of them > seem to fail with this or similar errors; a batch of ~10 will usually work > fine, a batch of 15 will see a few failures, and a batch of ~60 will have > dozens of failures. > [This gist shows 11 recent failures I > observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20566) ColumnVector should support `appendFloats` for array
[ https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20566: Assignee: (was: Apache Spark) > ColumnVector should support `appendFloats` for array > > > Key: SPARK-20566 > URL: https://issues.apache.org/jira/browse/SPARK-20566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun > > This issue aims to add a missing `appendFloats` API for array into > ColumnVector class. For double type, there is `appendDoubles` for array > [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20566) ColumnVector should support `appendFloats` for array
[ https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20566: Assignee: Apache Spark > ColumnVector should support `appendFloats` for array > > > Key: SPARK-20566 > URL: https://issues.apache.org/jira/browse/SPARK-20566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to add a missing `appendFloats` API for array into > ColumnVector class. For double type, there is `appendDoubles` for array > [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20566) ColumnVector should support `appendFloats` for array
[ https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993753#comment-15993753 ] Apache Spark commented on SPARK-20566: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17836 > ColumnVector should support `appendFloats` for array > > > Key: SPARK-20566 > URL: https://issues.apache.org/jira/browse/SPARK-20566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun > > This issue aims to add a missing `appendFloats` API for array into > ColumnVector class. For double type, there is `appendDoubles` for array > [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20566) ColumnVector should support `appendFloats` for array
[ https://issues.apache.org/jira/browse/SPARK-20566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20566: -- Description: This issue aims to add a missing `appendFloats` API for array into ColumnVector class. For double type, there is `appendDoubles` for array [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. (was: This PR aims to add a missing `appendFloats` API for array into ColumnVector class. For double type, there is `appendDoubles` for array [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824].) > ColumnVector should support `appendFloats` for array > > > Key: SPARK-20566 > URL: https://issues.apache.org/jira/browse/SPARK-20566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun > > This issue aims to add a missing `appendFloats` API for array into > ColumnVector class. For double type, there is `appendDoubles` for array > [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993745#comment-15993745 ] Apache Spark commented on SPARK-20557: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17835 > JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE > > > Key: SPARK-20557 > URL: https://issues.apache.org/jira/browse/SPARK-20557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.3.0 >Reporter: Jannik Arndt > Labels: easyfix, jdbc, oracle, sql, timestamp > Original Estimate: 2h > Remaining Estimate: 2h > > Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME > ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) > results in an error: > {{Unsupported type -101}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > That is because the type > {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}} > (in Java since 1.8) is missing in > {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}} > > This is similar to SPARK-7039. > I created a pull request with a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20566) ColumnVector should support `appendFloats` for array
Dongjoon Hyun created SPARK-20566: - Summary: ColumnVector should support `appendFloats` for array Key: SPARK-20566 URL: https://issues.apache.org/jira/browse/SPARK-20566 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.0 Reporter: Dongjoon Hyun This PR aims to add a missing `appendFloats` API for array into ColumnVector class. For double type, there is `appendDoubles` for array [here|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20565) Improve the error message for unsupported JDBC types
[ https://issues.apache.org/jira/browse/SPARK-20565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20565: Description: For unsupported data types, we simply output the type number instead of the type name. {noformat} java.sql.SQLException: Unsupported type 2014 {noformat} We should improve it by outputting its name. was: For unsupported data types, we simply output the type number instead of the type name. {noformat} java.sql.SQLException: Unsupported type 2014 {noformat} > Improve the error message for unsupported JDBC types > > > Key: SPARK-20565 > URL: https://issues.apache.org/jira/browse/SPARK-20565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > For unsupported data types, we simply output the type number instead of the > type name. > {noformat} > java.sql.SQLException: Unsupported type 2014 > {noformat} > We should improve it by outputting its name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20565) Improve the error message for unsupported JDBC types
Xiao Li created SPARK-20565: --- Summary: Improve the error message for unsupported JDBC types Key: SPARK-20565 URL: https://issues.apache.org/jira/browse/SPARK-20565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li For unsupported data types, we simply output the type number instead of the type name. {noformat} java.sql.SQLException: Unsupported type 2014 {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20529) Worker should not use the received Master address
[ https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-20529: Assignee: Shixiong Zhu > Worker should not use the received Master address > - > > Key: SPARK-20529 > URL: https://issues.apache.org/jira/browse/SPARK-20529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now when worker connects to master, master will send its address to the > worker. Then worker will save this address and use it to reconnect in case of > failure. > However, sometimes, this address is not correct. If there is a proxy between > master and worker, the address master sent is not the address of proxy. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20529) Worker should not use the received Master address
[ https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20529: - Affects Version/s: 2.2.0 1.6.3 2.0.2 > Worker should not use the received Master address > - > > Key: SPARK-20529 > URL: https://issues.apache.org/jira/browse/SPARK-20529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now when worker connects to master, master will send its address to the > worker. Then worker will save this address and use it to reconnect in case of > failure. > However, sometimes, this address is not correct. If there is a proxy between > master and worker, the address master sent is not the address of proxy. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20531) Spark master shouldn't send its address back to the workers over proxied connections
[ https://issues.apache.org/jira/browse/SPARK-20531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20531. -- Resolution: Duplicate > Spark master shouldn't send its address back to the workers over proxied > connections > > > Key: SPARK-20531 > URL: https://issues.apache.org/jira/browse/SPARK-20531 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0 >Reporter: Sameer Agarwal > > Currently, when a Spark worker connects to Spark master, the master sends its > address back to the worker (as part of the {{RegisteredWorker}} message). The > worker then saves this address and use it to talk to the master. The reason > behind this handshake protocol is that if the master goes down, a new master > can always send back {{RegisteredWorker}} messages to all the workers with > its (new) IP address. > However, if there is a proxy between the master and worker, this > unfortunately ends up bypassing the proxy. A simple fix here is that we > should encode the "destination address" in the {{RegisterWorker}} that can > then be sent back to the to the worker as part of the {{RegisteredWorker}} > message. > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20436: - Issue Type: Question (was: Bug) > NullPointerException when restart from checkpoint file > -- > > Key: SPARK-20436 > URL: https://issues.apache.org/jira/browse/SPARK-20436 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.5.0 >Reporter: fangfengbin > > I have written a Spark Streaming application which have two DStreams. > Code is : > {code} > object KafkaTwoInkfk { > def main(args: Array[String]) { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val ssc = StreamingContext.getOrCreate(checkPointDir, () => > createContext(args)) > ssc.start() > ssc.awaitTermination() > } > def createContext(args : Array[String]) : StreamingContext = { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val sparkConf = new SparkConf().setAppName("KafkaWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) > ssc.checkpoint(checkPointDir) > val topicArr1 = topic1.split(",") > val topicSet1 = topicArr1.toSet > val topicArr2 = topic2.split(",") > val topicSet2 = topicArr2.toSet > val kafkaParams = Map[String, String]( > "metadata.broker.list" -> brokers > ) > val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet1) > val words1 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts1 = words1.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts1.print() > val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet2) > val words2 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts2 = words2.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts2.print() > return ssc > } > } > {code} > when restart from checkpoint file, it throw NullPointerException: > java.lang.NullPointerException > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at
[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993663#comment-15993663 ] Shixiong Zhu commented on SPARK-20436: -- [~ffbin] the issue is this line {{val words2 = lines1.map(_._2).flatMap(_.split(" "))}}. You should replace {{lines1}} with {{lines2}}. Otherwise, {{lines2}} won't be used and cause this error. > NullPointerException when restart from checkpoint file > -- > > Key: SPARK-20436 > URL: https://issues.apache.org/jira/browse/SPARK-20436 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 >Reporter: fangfengbin > > I have written a Spark Streaming application which have two DStreams. > Code is : > {code} > object KafkaTwoInkfk { > def main(args: Array[String]) { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val ssc = StreamingContext.getOrCreate(checkPointDir, () => > createContext(args)) > ssc.start() > ssc.awaitTermination() > } > def createContext(args : Array[String]) : StreamingContext = { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val sparkConf = new SparkConf().setAppName("KafkaWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) > ssc.checkpoint(checkPointDir) > val topicArr1 = topic1.split(",") > val topicSet1 = topicArr1.toSet > val topicArr2 = topic2.split(",") > val topicSet2 = topicArr2.toSet > val kafkaParams = Map[String, String]( > "metadata.broker.list" -> brokers > ) > val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet1) > val words1 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts1 = words1.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts1.print() > val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet2) > val words2 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts2 = words2.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts2.print() > return ssc > } > } > {code} > when restart from checkpoint file, it throw NullPointerException: > java.lang.NullPointerException > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) >
[jira] [Resolved] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20436. -- Resolution: Not A Problem > NullPointerException when restart from checkpoint file > -- > > Key: SPARK-20436 > URL: https://issues.apache.org/jira/browse/SPARK-20436 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.5.0 >Reporter: fangfengbin > > I have written a Spark Streaming application which have two DStreams. > Code is : > {code} > object KafkaTwoInkfk { > def main(args: Array[String]) { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val ssc = StreamingContext.getOrCreate(checkPointDir, () => > createContext(args)) > ssc.start() > ssc.awaitTermination() > } > def createContext(args : Array[String]) : StreamingContext = { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val sparkConf = new SparkConf().setAppName("KafkaWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) > ssc.checkpoint(checkPointDir) > val topicArr1 = topic1.split(",") > val topicSet1 = topicArr1.toSet > val topicArr2 = topic2.split(",") > val topicSet2 = topicArr2.toSet > val kafkaParams = Map[String, String]( > "metadata.broker.list" -> brokers > ) > val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet1) > val words1 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts1 = words1.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts1.print() > val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet2) > val words2 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts2 = words2.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts2.print() > return ssc > } > } > {code} > when restart from checkpoint file, it throw NullPointerException: > java.lang.NullPointerException > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at
[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.
[ https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20547: - Target Version/s: (was: 2.2.0) > ExecutorClassLoader's findClass may not work correctly when a task is > cancelled. > > > Key: SPARK-20547 > URL: https://issues.apache.org/jira/browse/SPARK-20547 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > > ExecutorClassLoader's findClass may throw some transient exception. For > example, when a task is cancelled, if ExecutorClassLoader is running, you may > see InterruptedException or IOException, even if this class can be loaded. > Then the result of findClass will be cached by JVM, and later when the same > class is being loaded (note: in this case, this class may be still loadable), > it will just throw NoClassDefFoundError. > We should make ExecutorClassLoader retry on transient exceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.
[ https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20547: - Priority: Minor (was: Blocker) > ExecutorClassLoader's findClass may not work correctly when a task is > cancelled. > > > Key: SPARK-20547 > URL: https://issues.apache.org/jira/browse/SPARK-20547 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > > ExecutorClassLoader's findClass may throw some transient exception. For > example, when a task is cancelled, if ExecutorClassLoader is running, you may > see InterruptedException or IOException, even if this class can be loaded. > Then the result of findClass will be cached by JVM, and later when the same > class is being loaded (note: in this case, this class may be still loadable), > it will just throw NoClassDefFoundError. > We should make ExecutorClassLoader retry on transient exceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
Hua Liu created SPARK-20564: --- Summary: a lot of executor failures when the executor number is more than 2000 Key: SPARK-20564 URL: https://issues.apache.org/jira/browse/SPARK-20564 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.1.0, 1.6.2 Reporter: Hua Liu When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they are marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993464#comment-15993464 ] Herman van Hovell edited comment on SPARK-20556 at 5/2/17 6:34 PM: --- [~vlyubin] This was fixed in SPARK-18952. This should be part of the coming 2.1.1 release. was (Author: hvanhovell): [~vlyubin] This was fixed in SPARK-18952, lets just back port that change to 2.1. > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > Fix For: 2.2.0 > > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993465#comment-15993465 ] Apache Spark commented on SPARK-7481: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/17834 > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993464#comment-15993464 ] Herman van Hovell commented on SPARK-20556: --- [~vlyubin] This was fixed in SPARK-18952, lets just back port that change to 2.1. > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > Fix For: 2.2.0 > > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-20556. - Resolution: Duplicate Fix Version/s: 2.2.0 > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > Fix For: 2.2.0 > > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-20556: -- Component/s: (was: Optimizer) SQL > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20547) ExecutorClassLoader's findClass may not work correctly when a task is cancelled.
[ https://issues.apache.org/jira/browse/SPARK-20547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993426#comment-15993426 ] Shixiong Zhu commented on SPARK-20547: -- Did some investigation using the reproducer. Looks like it’s not a class loader issue. It’s because the class initialization result will be cached. My current proposal is recreating a new class loader if a task fails. > ExecutorClassLoader's findClass may not work correctly when a task is > cancelled. > > > Key: SPARK-20547 > URL: https://issues.apache.org/jira/browse/SPARK-20547 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Priority: Blocker > > ExecutorClassLoader's findClass may throw some transient exception. For > example, when a task is cancelled, if ExecutorClassLoader is running, you may > see InterruptedException or IOException, even if this class can be loaded. > Then the result of findClass will be cached by JVM, and later when the same > class is being loaded (note: in this case, this class may be still loadable), > it will just throw NoClassDefFoundError. > We should make ExecutorClassLoader retry on transient exceptions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20370) create external table on read only location fails
[ https://issues.apache.org/jira/browse/SPARK-20370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993405#comment-15993405 ] Steve Loughran commented on SPARK-20370: Is this the bit under the PR tagged "!! HACK ALERT !!" by any chance? If so, it seems to have gone in for a Hive metastore workaround. I wonder if there is/can be a solution in Hive-land. > create external table on read only location fails > - > > Key: SPARK-20370 > URL: https://issues.apache.org/jira/browse/SPARK-20370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 > Environment: EMR 5.4.0, hadoop 2.7.3, spark 2.1.0 >Reporter: Gaurav Shah >Priority: Minor > > Create External table via following fails: > sqlContext.createExternalTable( > "table_name", > "org.apache.spark.sql.parquet", > inputSchema, > Map( >"path" -> "s3a://bucket-name/folder", >"mergeSchema" -> "false" > ) >) > Spark in the following commit tries to check if it has write access to giving > location, which fails and so the table meta creation fails. > https://github.com/apache/spark/pull/13270/files > The table creation script works even if cluster has read only access in spark > 1.6, but fails in spark 2.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations
[ https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993323#comment-15993323 ] Steve Loughran commented on SPARK-20560: To follow this up, I've now got a test which verifies that (a) s3a returns "localhost" and (b) spark discards it. This'll catch any regressions in the s3a client. {code} val source = CSV_TESTFILE.get val fs = getFilesystem(source) val blockLocations = fs.getFileBlockLocations(source, 0, 1) assert(1 === blockLocations.length, s"block location array size wrong: ${blockLocations}") val hosts = blockLocations(0).getHosts assert(1 === hosts.length, s"wrong host size ${hosts}") assert("localhost" === hosts(0), "hostname") val path = source.toString val rdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](path, 1) val input = rdd.asInstanceOf[HadoopRDD[_, _]] val partitions = input.getPartitions val locations = input.getPreferredLocations(partitions.head) assert(locations.isEmpty, s"Location list not empty ${locations}") {code} > Review Spark's handling of filesystems returning "localhost" in > getFileBlockLocations > - > > Key: SPARK-20560 > URL: https://issues.apache.org/jira/browse/SPARK-20560 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Priority: Minor > > Some filesystems (S3a, Azure WASB) return "localhost" as the response to > {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the > preferred host when scheduling work, there's a risk that work will be queued > on one host, rather than spread across the cluster. > HIVE-14060 and TEZ-3291 have both seen it in their schedulers. > I don't know if Spark does it, someone needs to look at the code, maybe write > some tests -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20563) going to DataFrame to RDD and back changes the schema, if the schema is not explicitly provided
Danil Kirsanov created SPARK-20563: -- Summary: going to DataFrame to RDD and back changes the schema, if the schema is not explicitly provided Key: SPARK-20563 URL: https://issues.apache.org/jira/browse/SPARK-20563 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: Danil Kirsanov Priority: Minor df.rdd.toDF() converts the DataFrame of IntegerType to the LongType if the schema is not explicitly provided in toDF(). Below is a full reproduction code - from pyspark.sql.types import IntegerType, StructType, StructField schema = StructType([StructField("a",IntegerType(),True), StructField("b",IntegerType(),True)]) df_test = spark.createDataFrame([(1,2)], schema) df_test.printSchema() df_test.rdd.toDF().printSchema() -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kamal Gurala updated SPARK-20562: - Description: Make Spark be aware of offers that have an unavailability period set because of a scheduled Maintenance on the node. Have a configurable option that's a threshold which ensures that tasks are not scheduled on offers that are within a threshold for maintenance was: Make Spark be aware of offers that have an unavailability period set because of a scheduled Maintenance on the node. Have a configurable option that's a threshold which ensures that tasks are not scheduled on offers that are within a threshold for maintenance > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20562) Support Maintenance by having a threshold for unavailability
Kamal Gurala created SPARK-20562: Summary: Support Maintenance by having a threshold for unavailability Key: SPARK-20562 URL: https://issues.apache.org/jira/browse/SPARK-20562 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.1.0 Reporter: Kamal Gurala Make Spark be aware of offers that have an unavailability period set because of a scheduled Maintenance on the node. Have a configurable option that's a threshold which ensures that tasks are not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations
[ https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved SPARK-20560. Resolution: Invalid "localhost" is filtered, been done in {{HadoopRDD.getPreferredLocations()}} since commit #06aac8a. > Review Spark's handling of filesystems returning "localhost" in > getFileBlockLocations > - > > Key: SPARK-20560 > URL: https://issues.apache.org/jira/browse/SPARK-20560 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Priority: Minor > > Some filesystems (S3a, Azure WASB) return "localhost" as the response to > {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the > preferred host when scheduling work, there's a risk that work will be queued > on one host, rather than spread across the cluster. > HIVE-14060 and TEZ-3291 have both seen it in their schedulers. > I don't know if Spark does it, someone needs to look at the code, maybe write > some tests -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993023#comment-15993023 ] Volodymyr Lyubinets commented on SPARK-20556: - Here's an edited code that produces this: {code} import org.apache.spark.sql.functions.sum import org.apache.spark.sql.functions.{col, lit} val NON_TEMPORAL: String = "{\"type\":\"nontemporal\"}" val joined = spark.sqlContext.read.parquet(BLAH) val scores = joined.withColumn("temp", lit(NON_TEMPORAL)).select(col("x1"), col("temp"), col("x2"), col("x3")) val results = scores.groupBy(col("x1"), col("temp"), col("x3")).agg(BLAH).take(1000) {code} > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20561) Running SparkR with no RHive installed in secured environment
Natalie created SPARK-20561: --- Summary: Running SparkR with no RHive installed in secured environment Key: SPARK-20561 URL: https://issues.apache.org/jira/browse/SPARK-20561 Project: Spark Issue Type: Question Components: Examples, Input/Output Affects Versions: 2.1.0 Environment: Hadoop, Spark Reporter: Natalie I need to start running data mining analysis in secured environment (IP, Port, and database name are given), where Spark runs on hive tables. So I have installed R, SparkR, dplyr, and some other r libraries. Now I understand that I need to point sparkR to that database(with IP/Port/Name). What should be my R code? I start with evoking R, then SparkR library. Next I right sc<-sparkR.init() it tells me immediately that spark-submit command:not found Do I need to have RHive installed first? Or should I actually point somehow to spark library and to that database? I couldn't find any documentation on that. Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations
[ https://issues.apache.org/jira/browse/SPARK-20560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993008#comment-15993008 ] Steve Loughran commented on SPARK-20560: {{FileSystem.getFileBlockLocations(path)}} is only invoked from from {{HdfsUtils.getFileSegmentLocations}}, and used as a source of data for {{RDD.preferredLocations}} I don't see anything explicit through the code that detects & reacts to the FS call returning localhost; I'll do some test downstream to see what surfaces against S3. Unless the scheduler has some explicit "localhost -> anywhere" map, it might make sense for HdfsUtils.getFileSegmentLocation to downgrade "localhost" to None, on the basis that in a cluster FS, the data clearly doesn't know where it is. > Review Spark's handling of filesystems returning "localhost" in > getFileBlockLocations > - > > Key: SPARK-20560 > URL: https://issues.apache.org/jira/browse/SPARK-20560 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Priority: Minor > > Some filesystems (S3a, Azure WASB) return "localhost" as the response to > {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the > preferred host when scheduling work, there's a risk that work will be queued > on one host, rather than spread across the cluster. > HIVE-14060 and TEZ-3291 have both seen it in their schedulers. > I don't know if Spark does it, someone needs to look at the code, maybe write > some tests -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20560) Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations
Steve Loughran created SPARK-20560: -- Summary: Review Spark's handling of filesystems returning "localhost" in getFileBlockLocations Key: SPARK-20560 URL: https://issues.apache.org/jira/browse/SPARK-20560 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Steve Loughran Priority: Minor Some filesystems (S3a, Azure WASB) return "localhost" as the response to {{FileSystem.getFileBlockLocations(path)}}. If this is then used as the preferred host when scheduling work, there's a risk that work will be queued on one host, rather than spread across the cluster. HIVE-14060 and TEZ-3291 have both seen it in their schedulers. I don't know if Spark does it, someone needs to look at the code, maybe write some tests -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20559) Refreshing a cached RDD without restarting the Spark application
[ https://issues.apache.org/jira/browse/SPARK-20559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20559. --- Resolution: Invalid This should go to u...@spark.apache.org > Refreshing a cached RDD without restarting the Spark application > > > Key: SPARK-20559 > URL: https://issues.apache.org/jira/browse/SPARK-20559 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jayesh lalwani > > We have a Structured Streaming application that gets accounts from Kafka into > a streaming data frame. We have a blacklist of accounts stored in S3 and we > want to filter out all the accounts that are blacklisted. So, we are loading > the blacklisted accounts into a batch data frame and joining it with the > streaming data frame to filter out the bad accounts. > Now, the blacklist doesn't change very often.. once a week at max. SO, we > wanted to cache the blacklist data frame to prevent going out to S3 > everytime. Since, the blacklist might change, we want to be able to refresh > the cache at a cadence, without restarting the whole app. > So, to begin with we wrote a simple app that caches and refreshes a simple > data frame. The steps we followed are > * Create a CSV file > * load CSV into a DF: df = spark.read.csv(filename) > * Persist the data frame: df.persist > * Now when we do df.show, we see the contents of the csv. > * We change the CSV, and call df.show, we can see that the old contents are > being displayed, proving that the df is cached > * df.unpersist > * df.persist > * df.show > * What we see is that the rows that were modified in the CSV are reloaded.. > But new rows aren't > Is this expected behavior? Is there a better way to refresh cached data > without restarting the Spark application? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20559) Refreshing a cached RDD without restarting the Spark application
Jayesh lalwani created SPARK-20559: -- Summary: Refreshing a cached RDD without restarting the Spark application Key: SPARK-20559 URL: https://issues.apache.org/jira/browse/SPARK-20559 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.1.0 Reporter: Jayesh lalwani We have a Structured Streaming application that gets accounts from Kafka into a streaming data frame. We have a blacklist of accounts stored in S3 and we want to filter out all the accounts that are blacklisted. So, we are loading the blacklisted accounts into a batch data frame and joining it with the streaming data frame to filter out the bad accounts. Now, the blacklist doesn't change very often.. once a week at max. SO, we wanted to cache the blacklist data frame to prevent going out to S3 everytime. Since, the blacklist might change, we want to be able to refresh the cache at a cadence, without restarting the whole app. So, to begin with we wrote a simple app that caches and refreshes a simple data frame. The steps we followed are * Create a CSV file * load CSV into a DF: df = spark.read.csv(filename) * Persist the data frame: df.persist * Now when we do df.show, we see the contents of the csv. * We change the CSV, and call df.show, we can see that the old contents are being displayed, proving that the df is cached * df.unpersist * df.persist * df.show * What we see is that the rows that were modified in the CSV are reloaded.. But new rows aren't Is this expected behavior? Is there a better way to refresh cached data without restarting the Spark application? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19582) DataFrameReader conceptually inadequate
[ https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992985#comment-15992985 ] Steve Loughran commented on SPARK-19582: All spark is doing is taking a URL To data, mapping that to an FS implementation classname and expecting that to implement the methods in `org.apache.hadoop.FileSystem` so as to provide FS-like behaviour. Giving minio is nominally an S3 clone, sounds like there's a problem here setting up the hadoop S3a client to bind to it. I'd isolate that to the Hadoop code before going near Spark, test on Hadoop 2.8 & file bugs against Hadoop and/or minio if there are problems. AFAIK, nobody has run the Hadoop S3A [tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md] against minio; doing that and documenting how to configure the client would be a welcome contribution. If minio is 100% S3 compatible (c3/v4 auth + multipart PUT; encryption optional), then the S3A client should work with it...it could work as another integration test for minio. > DataFrameReader conceptually inadequate > --- > > Key: SPARK-19582 > URL: https://issues.apache.org/jira/browse/SPARK-19582 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 >Reporter: James Q. Arnold > > DataFrameReader assumes it "understands" all data sources (local file system, > object stores, jdbc, ...). This seems limiting in the long term, imposing > both development costs to accept new sources and dependency issues for > existing sources (how to coordinate the XX jar for internal use vs. the XX > jar used by the application). Unless I have missed how this can be done > currently, an application with an unsupported data source cannot create the > required RDD for distribution. > I recommend at least providing a text API for supplying data. Let the > application provide data as a String (or char[] or ...)---not a path, but the > actual data. Alternatively, provide interfaces or abstract classes the > application could provide to let the application handle external data > sources, without forcing all that complication into the Spark implementation. > I don't have any code to submit, but JIRA seemed like to most appropriate > place to raise the issue. > Finally, if I have overlooked how this can be done with the current API, a > new example would be appreciated. > Additional detail... > We use the minio object store, which provides an API compatible with AWS-S3. > A few configuration/parameter values differ for minio, but one can use the > AWS library in the application to connect to the minio server. > When trying to use minio objects through spark, the s3://xxx paths are > intercepted by spark and handed to hadoop. So far, I have been unable to > find the right combination of configuration values and parameters to > "convince" hadoop to forward the right information to work with minio. If I > could read the minio object in the application, and then hand the object > contents directly to spark, I could bypass hadoop and solve the problem. > Unfortunately, the underlying spark design prevents that. So, I see two > problems. > - Spark seems to have taken on the responsibility of "knowing" the API > details of all data sources. This seems iffy in the long run (and is the > root of my current problem). In the long run, it seems unwise to assume that > spark should understand all possible path names, protocols, etc. Moreover, > passing S3 paths to hadoop seems a little odd (why not go directly to AWS, > for example). This particular confusion about S3 shows the difficulties that > are bound to occur. > - Second, spark appears not to have a way to bypass the path name > interpretation. At the least, spark could provide a text/blob interface, > letting the application supply the data object and avoid path interpretation > inside spark. Alternatively, spark could accept a reader/stream/... to build > the object, again letting the application provide the implementation of the > object input. > As I mentioned above, I might be missing something in the API that lets us > work around the problem. I'll keep looking, but the API as apparently > structured seems too limiting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992955#comment-15992955 ] Wenchen Fan commented on SPARK-20548: - I'll re-enable this test after https://github.com/apache/spark/pull/17833 is merged > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
[ https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Feher updated SPARK-20555: Description: When querying an Oracle database, Spark maps some Oracle numeric data types to incorrect Catalyst data types: 1. DECIMAL(1) becomes BooleanType In Orcale, a DECIMAL(1) can have values from -9 to 9. In Spark now, values larger than 1 become the boolean value true. 2. DECIMAL(3,2) becomes IntegerType In Oracle, a DECIMAL(2) can have values like 1.23 In Spark now, digits after the decimal point are dropped. 3. DECIMAL(10) becomes IntegerType In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is more than 2^31 Spark throws an exception: "java.sql.SQLException: Numeric Overflow" I think the best solution is to always keep Oracle's decimal types. (In theory we could introduce a FloatType in some case of #2, and fix #3 by only introducing IntegerType for DECIMAL(9). But in my opinion, that would end up complicated and error-prone.) Note: I think the above problems were introduced as part of https://github.com/apache/spark/pull/14377 The main purpose of that PR seems to be converting Spark types to correct Oracle types, and that part seems good to me. But it also adds the inverse conversions. As it turns out in the above examples, that is not possible. was: When querying an Oracle database, Spark maps some Oracle numeric data types to incorrect Catalyst data types: 1. DECIMAL(1) becomes BooleanType In Orcale, a DECIMAL(1) can have values from -9 to 9. In Spark now, values larger than 1 become the boolean value true. 2. DECIMAL(3,2) becomes IntegerType In Oracle, a DECIMAL(2) can have values like 1.23 In Spark now, digits after the decimal point are dropped. 3. DECIMAL(10) becomes IntegerType In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is more than 2^31 Spark throws an exception: "java.sql.SQLException: Numeric Overflow" I think the best solution is to always keep Oracle's decimal types. (In theory we could introduce a FloatType in some case of #2, and fix #3 by only introducing IntegerType for DECIMAL(9). But in my opinion, that would end up complicated and error-prone.) Note: I think the above problems were introduced as part of https://github.com/apache/spark/pull/14377/files The main purpose of that PR seems to be converting Spark types to correct Oracle types, and that part seems good to me. But it also adds the inverse conversions. As it turns out in the above examples, that is not possible. > Incorrect handling of Oracle's decimal types via JDBC > - > > Key: SPARK-20555 > URL: https://issues.apache.org/jira/browse/SPARK-20555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Gabor Feher > > When querying an Oracle database, Spark maps some Oracle numeric data types > to incorrect Catalyst data types: > 1. DECIMAL(1) becomes BooleanType > In Orcale, a DECIMAL(1) can have values from -9 to 9. > In Spark now, values larger than 1 become the boolean value true. > 2. DECIMAL(3,2) becomes IntegerType > In Oracle, a DECIMAL(2) can have values like 1.23 > In Spark now, digits after the decimal point are dropped. > 3. DECIMAL(10) becomes IntegerType > In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is > more than 2^31 > Spark throws an exception: "java.sql.SQLException: Numeric Overflow" > I think the best solution is to always keep Oracle's decimal types. (In > theory we could introduce a FloatType in some case of #2, and fix #3 by only > introducing IntegerType for DECIMAL(9). But in my opinion, that would end up > complicated and error-prone.) > Note: I think the above problems were introduced as part of > https://github.com/apache/spark/pull/14377 > The main purpose of that PR seems to be converting Spark types to correct > Oracle types, and that part seems good to me. But it also adds the inverse > conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
[ https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Feher updated SPARK-20555: Description: When querying an Oracle database, Spark maps some Oracle numeric data types to incorrect Catalyst data types: 1. DECIMAL(1) becomes BooleanType In Orcale, a DECIMAL(1) can have values from -9 to 9. In Spark now, values larger than 1 become the boolean value true. 2. DECIMAL(3,2) becomes IntegerType In Oracle, a DECIMAL(2) can have values like 1.23 In Spark now, digits after the decimal point are dropped. 3. DECIMAL(10) becomes IntegerType In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is more than 2^31 Spark throws an exception: "java.sql.SQLException: Numeric Overflow" I think the best solution is to always keep Oracle's decimal types. (In theory we could introduce a FloatType in some case of #2, and fix #3 by only introducing IntegerType for DECIMAL(9). But in my opinion, that would end up complicated and error-prone.) Note: I think the above problems were introduced as part of https://github.com/apache/spark/pull/14377/files The main purpose of that PR seems to be converting Spark types to correct Oracle types, and that part seems good to me. But it also adds the inverse conversions. As it turns out in the above examples, that is not possible. was: When querying an Oracle database, Spark maps some Oracle numeric data types to incorrect Catalyst data types: 1. DECIMAL(1) becomes BooleanType In Orcale, a DECIMAL(1) can have values from -9 to 9. In Spark now, values larger than 1 become the boolean value true. 2. DECIMAL(3,2) becomes IntegerType In Oracle, a DECIMAL(2) can have values like 1.23 In Spark now, digits after the decimal point are dropped. 3. DECIMAL(10) becomes IntegerType In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is more than 2^31 Spark throws an exception: "java.sql.SQLException: Numeric Overflow" I think the best solution is to always keep Oracle's decimal types. (In theory we could introduce a FloatType in some case of #2, and fix #3 by only introducing IntegerType for DECIMAL(9). But in my opinion, that would end up complicated and error-prone.) Note: I think the above problems were introduced as part of The main purpose of that PR seems to be converting Spark types to correct Oracle types, and that part seems good to me. But it also adds the inverse conversions. As it turns out in the above examples, that is not possible. > Incorrect handling of Oracle's decimal types via JDBC > - > > Key: SPARK-20555 > URL: https://issues.apache.org/jira/browse/SPARK-20555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Gabor Feher > > When querying an Oracle database, Spark maps some Oracle numeric data types > to incorrect Catalyst data types: > 1. DECIMAL(1) becomes BooleanType > In Orcale, a DECIMAL(1) can have values from -9 to 9. > In Spark now, values larger than 1 become the boolean value true. > 2. DECIMAL(3,2) becomes IntegerType > In Oracle, a DECIMAL(2) can have values like 1.23 > In Spark now, digits after the decimal point are dropped. > 3. DECIMAL(10) becomes IntegerType > In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is > more than 2^31 > Spark throws an exception: "java.sql.SQLException: Numeric Overflow" > I think the best solution is to always keep Oracle's decimal types. (In > theory we could introduce a FloatType in some case of #2, and fix #3 by only > introducing IntegerType for DECIMAL(9). But in my opinion, that would end up > complicated and error-prone.) > Note: I think the above problems were introduced as part of > https://github.com/apache/spark/pull/14377/files > The main purpose of that PR seems to be converting Spark types to correct > Oracle types, and that part seems good to me. But it also adds the inverse > conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it
[ https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20558: Assignee: Apache Spark (was: Wenchen Fan) > clear InheritableThreadLocal variables in SparkContext when stopping it > --- > > Key: SPARK-20558 > URL: https://issues.apache.org/jira/browse/SPARK-20558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it
[ https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992941#comment-15992941 ] Apache Spark commented on SPARK-20558: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17833 > clear InheritableThreadLocal variables in SparkContext when stopping it > --- > > Key: SPARK-20558 > URL: https://issues.apache.org/jira/browse/SPARK-20558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it
[ https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20558: Assignee: Wenchen Fan (was: Apache Spark) > clear InheritableThreadLocal variables in SparkContext when stopping it > --- > > Key: SPARK-20558 > URL: https://issues.apache.org/jira/browse/SPARK-20558 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992922#comment-15992922 ] Apache Spark commented on SPARK-20557: -- User 'JannikArndt' has created a pull request for this issue: https://github.com/apache/spark/pull/17832 > JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE > > > Key: SPARK-20557 > URL: https://issues.apache.org/jira/browse/SPARK-20557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.3.0 >Reporter: Jannik Arndt > Labels: easyfix, jdbc, oracle, sql, timestamp > Original Estimate: 2h > Remaining Estimate: 2h > > Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME > ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) > results in an error: > {{Unsupported type -101}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > That is because the type > {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}} > (in Java since 1.8) is missing in > {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}} > > This is similar to SPARK-7039. > I created a pull request with a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20557: Assignee: (was: Apache Spark) > JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE > > > Key: SPARK-20557 > URL: https://issues.apache.org/jira/browse/SPARK-20557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.3.0 >Reporter: Jannik Arndt > Labels: easyfix, jdbc, oracle, sql, timestamp > Original Estimate: 2h > Remaining Estimate: 2h > > Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME > ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) > results in an error: > {{Unsupported type -101}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > That is because the type > {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}} > (in Java since 1.8) is missing in > {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}} > > This is similar to SPARK-7039. > I created a pull request with a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
[ https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20557: Assignee: Apache Spark > JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE > > > Key: SPARK-20557 > URL: https://issues.apache.org/jira/browse/SPARK-20557 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.3.0 >Reporter: Jannik Arndt >Assignee: Apache Spark > Labels: easyfix, jdbc, oracle, sql, timestamp > Original Estimate: 2h > Remaining Estimate: 2h > > Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME > ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) > results in an error: > {{Unsupported type -101}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} > That is because the type > {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}} > (in Java since 1.8) is missing in > {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}} > > This is similar to SPARK-7039. > I created a pull request with a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it
Wenchen Fan created SPARK-20558: --- Summary: clear InheritableThreadLocal variables in SparkContext when stopping it Key: SPARK-20558 URL: https://issues.apache.org/jira/browse/SPARK-20558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.2, 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992886#comment-15992886 ] Jessie Yu commented on SPARK-20546: --- Given the current code relies on posix mode being off, turning it off explicitly shouldn't affect other behavior, especially since the change is confined to just the spark-class subshell. > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Priority: Minor > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
Jannik Arndt created SPARK-20557: Summary: JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE Key: SPARK-20557 URL: https://issues.apache.org/jira/browse/SPARK-20557 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 2.3.0 Reporter: Jannik Arndt Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) results in an error: {{Unsupported type -101}} {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}} {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}} That is because the type {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}} (in Java since 1.8) is missing in {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}} This is similar to SPARK-7039. I created a pull request with a fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20395) Upgrade to Scala 2.11.11
[ https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992814#comment-15992814 ] Sean Owen commented on SPARK-20395: --- [~jeremyrsmith] I tried updating to Scala 2.11.11 today and it worked fine, including generating docs, with unidoc + genjavadoc 0.10. What issue should I be looking for? if not, this could go in to master for 2.3. > Upgrade to Scala 2.11.11 > > > Key: SPARK-20395 > URL: https://issues.apache.org/jira/browse/SPARK-20395 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Jeremy Smith >Priority: Minor > Labels: dependencies, scala > > Update Scala to 2.11.11, which was released yesterday: > https://github.com/scala/scala/releases/tag/v2.11.11 > Since it's a patch version upgrade and binary compatibility is guaranteed, > impact should be minimal. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992813#comment-15992813 ] Sean Owen commented on SPARK-20556: --- Do you have a reproduction? > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private long emptyVOff; > /* 098 */ private int emptyVLen; > /* 099 */ private boolean isBatchFull = false; > /* 100 */ > It looks like on line 93 it failed to escape that string (that happened to be > in my code). I'm not sure how critical this is, but seems like there's > escaping missing somewhere. > Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python
[ https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18777: Assignee: (was: Apache Spark) > Return UDF objects when registering from Python > --- > > Key: SPARK-18777 > URL: https://issues.apache.org/jira/browse/SPARK-18777 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: holdenk > > In Scala when registering a UDF it gives you back a UDF object that you can > use in the Dataset/DataFrame API as well as with SQL expressions. We can do > the same in Python, for both Python UDFs and Java UDFs registered from Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python
[ https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18777: Assignee: Apache Spark > Return UDF objects when registering from Python > --- > > Key: SPARK-18777 > URL: https://issues.apache.org/jira/browse/SPARK-18777 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: holdenk >Assignee: Apache Spark > > In Scala when registering a UDF it gives you back a UDF object that you can > use in the Dataset/DataFrame API as well as with SQL expressions. We can do > the same in Python, for both Python UDFs and Java UDFs registered from Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18777) Return UDF objects when registering from Python
[ https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992812#comment-15992812 ] Apache Spark commented on SPARK-18777: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17831 > Return UDF objects when registering from Python > --- > > Key: SPARK-18777 > URL: https://issues.apache.org/jira/browse/SPARK-18777 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: holdenk > > In Scala when registering a UDF it gives you back a UDF object that you can > use in the Dataset/DataFrame API as well as with SQL expressions. We can do > the same in Python, for both Python UDFs and Java UDFs registered from Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20556) codehaus fails to generate code because of unescaped strings
[ https://issues.apache.org/jira/browse/SPARK-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Lyubinets updated SPARK-20556: Description: I guess somewhere along the way Spark uses codehaus to generate optimized code, but if it fails to do so, it falls back to an alternative way. Here's a log string that I see when executing one command on dataframes: 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, Column 13: ')' expected instead of 'type' ... /* 088 */ private double loadFactor = 0.5; /* 089 */ private int numBuckets = (int) (capacity / loadFactor); /* 090 */ private int maxSteps = 2; /* 091 */ private int numRows = 0; /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("taxonomyPayload", org.apache.spark.sql.types.DataTypes.StringType) /* 093 */ .add("{"type":"nontemporal"}", org.apache.spark.sql.types.DataTypes.StringType) /* 094 */ .add("spatialPayload", org.apache.spark.sql.types.DataTypes.StringType); /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.DoubleType); /* 096 */ private Object emptyVBase; /* 097 */ private long emptyVOff; /* 098 */ private int emptyVLen; /* 099 */ private boolean isBatchFull = false; /* 100 */ It looks like on line 93 it failed to escape that string (that happened to be in my code). I'm not sure how critical this is, but seems like there's escaping missing somewhere. Stack trace that happens afterwards: https://pastebin.com/NmgTfwN0 was: I guess somewhere along the way Spark uses codehaus to generate optimized code, but if it fails to do so, it falls back to an alternative way. Here's a log string that I see when executing one command on dataframes: 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, Column 13: ')' expected instead of 'type' ... /* 088 */ private double loadFactor = 0.5; /* 089 */ private int numBuckets = (int) (capacity / loadFactor); /* 090 */ private int maxSteps = 2; /* 091 */ private int numRows = 0; /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("taxonomyPayload", org.apache.spark.sql.types.DataTypes.StringType) /* 093 */ .add("{"type":"nontemporal"}", org.apache.spark.sql.types.DataTypes.StringType) /* 094 */ .add("spatialPayload", org.apache.spark.sql.types.DataTypes.StringType); /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.DoubleType); /* 096 */ private Object emptyVBase; /* 097 */ private long emptyVOff; /* 098 */ private int emptyVLen; /* 099 */ private boolean isBatchFull = false; /* 100 */ It looks like on line 93 it failed to escape that string (that happened to be in my code). I'm not sure how critical this is, but seems like there's escaping missing somewhere. > codehaus fails to generate code because of unescaped strings > > > Key: SPARK-20556 > URL: https://issues.apache.org/jira/browse/SPARK-20556 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Volodymyr Lyubinets > > I guess somewhere along the way Spark uses codehaus to generate optimized > code, but if it fails to do so, it falls back to an alternative way. Here's a > log string that I see when executing one command on dataframes: > 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 93, Column 13: ')' expected instead of 'type' > ... > /* 088 */ private double loadFactor = 0.5; > /* 089 */ private int numBuckets = (int) (capacity / loadFactor); > /* 090 */ private int maxSteps = 2; > /* 091 */ private int numRows = 0; > /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new > org.apache.spark.sql.types.StructType().add("taxonomyPayload", > org.apache.spark.sql.types.DataTypes.StringType) > /* 093 */ .add("{"type":"nontemporal"}", > org.apache.spark.sql.types.DataTypes.StringType) > /* 094 */ .add("spatialPayload", > org.apache.spark.sql.types.DataTypes.StringType); > /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new > org.apache.spark.sql.types.StructType().add("sum", > org.apache.spark.sql.types.DataTypes.DoubleType); > /* 096 */ private Object emptyVBase; > /* 097 */ private
[jira] [Created] (SPARK-20556) codehaus fails to generate code because of unescaped strings
Volodymyr Lyubinets created SPARK-20556: --- Summary: codehaus fails to generate code because of unescaped strings Key: SPARK-20556 URL: https://issues.apache.org/jira/browse/SPARK-20556 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.1.0 Reporter: Volodymyr Lyubinets I guess somewhere along the way Spark uses codehaus to generate optimized code, but if it fails to do so, it falls back to an alternative way. Here's a log string that I see when executing one command on dataframes: 17/05/02 12:00:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 93, Column 13: ')' expected instead of 'type' ... /* 088 */ private double loadFactor = 0.5; /* 089 */ private int numBuckets = (int) (capacity / loadFactor); /* 090 */ private int maxSteps = 2; /* 091 */ private int numRows = 0; /* 092 */ private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("taxonomyPayload", org.apache.spark.sql.types.DataTypes.StringType) /* 093 */ .add("{"type":"nontemporal"}", org.apache.spark.sql.types.DataTypes.StringType) /* 094 */ .add("spatialPayload", org.apache.spark.sql.types.DataTypes.StringType); /* 095 */ private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.DoubleType); /* 096 */ private Object emptyVBase; /* 097 */ private long emptyVOff; /* 098 */ private int emptyVLen; /* 099 */ private boolean isBatchFull = false; /* 100 */ It looks like on line 93 it failed to escape that string (that happened to be in my code). I'm not sure how critical this is, but seems like there's escaping missing somewhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992745#comment-15992745 ] Ramgopal N commented on SPARK-3528: --- I have spark running on Mesos. Mesos agents are running on node1,node2,node3 and datanodes on node4,node5 and node6. I see 3 executors running one on each of the 3 Mesos agents. So in my case PROCESS_LOCAL and NODE_LOCAL are same i believe. Basically i am trying to check the spark sql performance when there is no datalocality. When i execute spark sql, all the tasks are showing as PROCESS_LOCAL. what is importance of "spark.locality.wait.process" for spark on mesos. Is this configuration applicable for standalone spark? > Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL > > > Key: SPARK-3528 > URL: https://issues.apache.org/jira/browse/SPARK-3528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Critical > > Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task > {noformat} > spark> sc.textFile("pom.xml").count > ... > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:20862+20863 > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:0+20862 > {noformat} > There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: > {noformat} > override def getPreferredLocations(split: Partition): Seq[String] = { > // TODO: Filtering out "localhost" in case of file:// URLs > val hadoopSplit = split.asInstanceOf[HadoopPartition] > hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992699#comment-15992699 ] Hyukjin Kwon commented on SPARK-19019: -- To solve this problem fully, I had to port cloudpickle change too in the PR. Only fixing hijected one described above dose not fully solve this issue. Please refer the discussion in the PR and the change. > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > values in the function (non-bound arguments). > This ends up as below: > {code} > import types > import collections > def _copy_func(f): > return types.FunctionType(f.__code__, f.__globals__, f.__name__, > f.__defaults__, f.__closure__) > _old_namedtuple = _copy_func(collections.namedtuple) > _old_namedtuple(, "b") > _old_namedtuple("a") > {code} > If we call as below: > {code} > >>> _old_namedtuple("a", "b") > Traceback (most recent call last): > File "", line 1, in > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > It throws an exception as above becuase {{__kwdefaults__}} for required > keyword arguments seem unset in the copied function. So, if we give explicit > value for these, > {code} > >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None) > > {code} > It works fine. > It seems now we should properly set these into the hijected one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19019: - Fix Version/s: 2.0.3 1.6.4 > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > values in the function (non-bound arguments). > This ends up as below: > {code} > import types > import collections > def _copy_func(f): > return types.FunctionType(f.__code__, f.__globals__, f.__name__, > f.__defaults__, f.__closure__) > _old_namedtuple = _copy_func(collections.namedtuple) > _old_namedtuple(, "b") > _old_namedtuple("a") > {code} > If we call as below: > {code} > >>> _old_namedtuple("a", "b") > Traceback (most recent call last): > File "", line 1, in > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > It throws an exception as above becuase {{__kwdefaults__}} for required > keyword arguments seem unset in the copied function. So, if we give explicit > value for these, > {code} > >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None) > > {code} > It works fine. > It seems now we should properly set these into the hijected one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
[ https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20555: Assignee: (was: Apache Spark) > Incorrect handling of Oracle's decimal types via JDBC > - > > Key: SPARK-20555 > URL: https://issues.apache.org/jira/browse/SPARK-20555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Gabor Feher > > When querying an Oracle database, Spark maps some Oracle numeric data types > to incorrect Catalyst data types: > 1. DECIMAL(1) becomes BooleanType > In Orcale, a DECIMAL(1) can have values from -9 to 9. > In Spark now, values larger than 1 become the boolean value true. > 2. DECIMAL(3,2) becomes IntegerType > In Oracle, a DECIMAL(2) can have values like 1.23 > In Spark now, digits after the decimal point are dropped. > 3. DECIMAL(10) becomes IntegerType > In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is > more than 2^31 > Spark throws an exception: "java.sql.SQLException: Numeric Overflow" > I think the best solution is to always keep Oracle's decimal types. (In > theory we could introduce a FloatType in some case of #2, and fix #3 by only > introducing IntegerType for DECIMAL(9). But in my opinion, that would end up > complicated and error-prone.) > Note: I think the above problems were introduced as part of > The main purpose of that PR seems to be converting Spark types to correct > Oracle types, and that part seems good to me. But it also adds the inverse > conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
[ https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20555: Assignee: Apache Spark > Incorrect handling of Oracle's decimal types via JDBC > - > > Key: SPARK-20555 > URL: https://issues.apache.org/jira/browse/SPARK-20555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Gabor Feher >Assignee: Apache Spark > > When querying an Oracle database, Spark maps some Oracle numeric data types > to incorrect Catalyst data types: > 1. DECIMAL(1) becomes BooleanType > In Orcale, a DECIMAL(1) can have values from -9 to 9. > In Spark now, values larger than 1 become the boolean value true. > 2. DECIMAL(3,2) becomes IntegerType > In Oracle, a DECIMAL(2) can have values like 1.23 > In Spark now, digits after the decimal point are dropped. > 3. DECIMAL(10) becomes IntegerType > In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is > more than 2^31 > Spark throws an exception: "java.sql.SQLException: Numeric Overflow" > I think the best solution is to always keep Oracle's decimal types. (In > theory we could introduce a FloatType in some case of #2, and fix #3 by only > introducing IntegerType for DECIMAL(9). But in my opinion, that would end up > complicated and error-prone.) > Note: I think the above problems were introduced as part of > The main purpose of that PR seems to be converting Spark types to correct > Oracle types, and that part seems good to me. But it also adds the inverse > conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
[ https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992685#comment-15992685 ] Apache Spark commented on SPARK-20555: -- User 'gaborfeher' has created a pull request for this issue: https://github.com/apache/spark/pull/17830 > Incorrect handling of Oracle's decimal types via JDBC > - > > Key: SPARK-20555 > URL: https://issues.apache.org/jira/browse/SPARK-20555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Gabor Feher > > When querying an Oracle database, Spark maps some Oracle numeric data types > to incorrect Catalyst data types: > 1. DECIMAL(1) becomes BooleanType > In Orcale, a DECIMAL(1) can have values from -9 to 9. > In Spark now, values larger than 1 become the boolean value true. > 2. DECIMAL(3,2) becomes IntegerType > In Oracle, a DECIMAL(2) can have values like 1.23 > In Spark now, digits after the decimal point are dropped. > 3. DECIMAL(10) becomes IntegerType > In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is > more than 2^31 > Spark throws an exception: "java.sql.SQLException: Numeric Overflow" > I think the best solution is to always keep Oracle's decimal types. (In > theory we could introduce a FloatType in some case of #2, and fix #3 by only > introducing IntegerType for DECIMAL(9). But in my opinion, that would end up > complicated and error-prone.) > Note: I think the above problems were introduced as part of > The main purpose of that PR seems to be converting Spark types to correct > Oracle types, and that part seems good to me. But it also adds the inverse > conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC
Gabor Feher created SPARK-20555: --- Summary: Incorrect handling of Oracle's decimal types via JDBC Key: SPARK-20555 URL: https://issues.apache.org/jira/browse/SPARK-20555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Gabor Feher When querying an Oracle database, Spark maps some Oracle numeric data types to incorrect Catalyst data types: 1. DECIMAL(1) becomes BooleanType In Orcale, a DECIMAL(1) can have values from -9 to 9. In Spark now, values larger than 1 become the boolean value true. 2. DECIMAL(3,2) becomes IntegerType In Oracle, a DECIMAL(2) can have values like 1.23 In Spark now, digits after the decimal point are dropped. 3. DECIMAL(10) becomes IntegerType In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is more than 2^31 Spark throws an exception: "java.sql.SQLException: Numeric Overflow" I think the best solution is to always keep Oracle's decimal types. (In theory we could introduce a FloatType in some case of #2, and fix #3 by only introducing IntegerType for DECIMAL(9). But in my opinion, that would end up complicated and error-prone.) Note: I think the above problems were introduced as part of The main purpose of that PR seems to be converting Spark types to correct Oracle types, and that part seems good to me. But it also adds the inverse conversions. As it turns out in the above examples, that is not possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18891) Support for specific collection types
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992672#comment-15992672 ] Nils Grabbert commented on SPARK-18891: --- Example in SPARK-19104 still not working. > Support for specific collection types > - > > Key: SPARK-18891 > URL: https://issues.apache.org/jira/browse/SPARK-18891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > Fix For: 2.2.0 > > > Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which > force users to only define classes with the most generic type. > An [example > error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: > {code} > case class SpecificCollection(aList: List[Int]) > Seq(SpecificCollection(1 :: Nil)).toDS().collect() > {code} > {code} > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 98, Column 120: No applicable constructor/method found > for actual parameters "scala.collection.Seq"; candidates are: > "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls
[ https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992614#comment-15992614 ] Umesh Chaudhary commented on SPARK-20554: - [~srowen] I can work on this. Currently seeing 15 occurrences of scala.language.reflectiveCalls imports. Will re-evaluate the warnings after removing the imports of reflectiveCalls. > Remove usage of scala.language.reflectiveCalls > -- > > Key: SPARK-20554 > URL: https://issues.apache.org/jira/browse/SPARK-20554 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > In several parts of the code we have imported > {{scala.language.reflectiveCalls}} to suppress a warning about, well, > reflective calls. I know from cleaning up build warnings in 2.2 that in > almost all cases of this are inadvertent and masking a type problem. > Example, in HiveDDLSuite: > {code} > val expectedTablePath = > if (dbPath.isEmpty) { > hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier) > } else { > new Path(new Path(dbPath.get), tableIdentifier.table) > } > val filesystemPath = new Path(expectedTablePath.toString) > {code} > This shouldn't really work because one branch returns a URI and the other a > Path. In this case it only needs an object with a toString method and can > make this work with structural types and reflection. > Obviously, the intent was to add ".toURI" to the second branch though to make > both a URI! > I think we should probably clean this up by taking out all imports of > reflectiveCalls, and re-evaluating all of the warnings. There may be a few > legit usages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20554) Remove usage of scala.language.reflectiveCalls
Sean Owen created SPARK-20554: - Summary: Remove usage of scala.language.reflectiveCalls Key: SPARK-20554 URL: https://issues.apache.org/jira/browse/SPARK-20554 Project: Spark Issue Type: Improvement Components: ML, Spark Core, SQL, Structured Streaming Affects Versions: 2.1.0 Reporter: Sean Owen Priority: Minor In several parts of the code we have imported {{scala.language.reflectiveCalls}} to suppress a warning about, well, reflective calls. I know from cleaning up build warnings in 2.2 that in almost all cases of this are inadvertent and masking a type problem. Example, in HiveDDLSuite: {code} val expectedTablePath = if (dbPath.isEmpty) { hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier) } else { new Path(new Path(dbPath.get), tableIdentifier.table) } val filesystemPath = new Path(expectedTablePath.toString) {code} This shouldn't really work because one branch returns a URI and the other a Path. In this case it only needs an object with a toString method and can make this work with structural types and reflection. Obviously, the intent was to add ".toURI" to the second branch though to make both a URI! I think we should probably clean this up by taking out all imports of reflectiveCalls, and re-evaluating all of the warnings. There may be a few legit usages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20553) Update ALS examples for ML to illustrate recommend all
Nick Pentreath created SPARK-20553: -- Summary: Update ALS examples for ML to illustrate recommend all Key: SPARK-20553 URL: https://issues.apache.org/jira/browse/SPARK-20553 Project: Spark Issue Type: Documentation Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Nick Pentreath Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20300. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17622 [https://github.com/apache/spark/pull/17622] > Python API for ALSModel.recommendForAllUsers,Items > -- > > Key: SPARK-20300 > URL: https://issues.apache.org/jira/browse/SPARK-20300 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > Fix For: 2.2.0 > > > Python API for ALSModel methods recommendForAllUsers, recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20411) New features for expression.scalalang.typed
[ https://issues.apache.org/jira/browse/SPARK-20411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992517#comment-15992517 ] Jason Moore commented on SPARK-20411: - And, ideally, anything else within org.apache.spark.sql.functions (e.g. countDistinct). We're looking to replace our use of DataFrames with Datasets, which means finding a replacement for all the aggregation functions that we use. If I end up putting together some functions myself, I'll pop back here to contribute them. > New features for expression.scalalang.typed > --- > > Key: SPARK-20411 > URL: https://issues.apache.org/jira/browse/SPARK-20411 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Loic Descotte >Priority: Minor > > In Spark 2 it is possible to use typed expressions for aggregation methods: > {code} > import org.apache.spark.sql.expressions.scalalang._ > dataset.groupByKey(_.productId).agg(typed.sum[Token](_.score)).toDF("productId", > "sum").orderBy('productId).show > {code} > It seems that only avg, count and sum are defined : > https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/expressions/scalalang/typed.html > It is very nice to be able to use a typesafe DSL, but it would be good to > have more possibilities, like min and max functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992497#comment-15992497 ] Nick Pentreath commented on SPARK-20443: Interesting - though it appears to me that {{2048}} is the best setting for both data sizes. At the least I think we should adjust the default. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992492#comment-15992492 ] Teng Jiang commented on SPARK-20443: All the tests above were did with SPARK-11968 / [PR #17742 | https://github.com/apache/spark/pull/17742]. The blockSize still makes sense considering the times of data fetching per iteration and the GC time. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20551) ImportError adding custom class from jar in pyspark
[ https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992491#comment-15992491 ] Nick Pentreath commented on SPARK-20551: Yes I agree that it appears you're trying to import Java or Scala classes in Python which won't work. I suggest you post a question to the Spark user list asking for help: u...@spark.apache.org. Please indicate what you are trying to do and provide some example code (it appears that you're trying to read a custom Hadoop {{InputFormat}} in PySpark? > ImportError adding custom class from jar in pyspark > --- > > Key: SPARK-20551 > URL: https://issues.apache.org/jira/browse/SPARK-20551 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.1.0 >Reporter: Sergio Monteiro > > the flowwing imports are failing in PySpark, even when I set the --jars or > --driver-class-path: > import net.ripe.hadoop.pcap.io.PcapInputFormat > import net.ripe.hadoop.pcap.io.CombinePcapInputFormat > import net.ripe.hadoop.pcap.packet.Packet > Using Python version 2.7.12 (default, Nov 19 2016 06:48:10) > SparkSession available as 'spark'. > >>> import net.ripe.hadoop.pcap.io.PcapInputFormat > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat > >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat > >>> import net.ripe.hadoop.pcap.packet.Packet > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.packet.Packet > >>> > The same works great in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20551) ImportError adding custom class from jar in pyspark
[ https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-20551. -- Resolution: Not A Problem > ImportError adding custom class from jar in pyspark > --- > > Key: SPARK-20551 > URL: https://issues.apache.org/jira/browse/SPARK-20551 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.1.0 >Reporter: Sergio Monteiro > > the flowwing imports are failing in PySpark, even when I set the --jars or > --driver-class-path: > import net.ripe.hadoop.pcap.io.PcapInputFormat > import net.ripe.hadoop.pcap.io.CombinePcapInputFormat > import net.ripe.hadoop.pcap.packet.Packet > Using Python version 2.7.12 (default, Nov 19 2016 06:48:10) > SparkSession available as 'spark'. > >>> import net.ripe.hadoop.pcap.io.PcapInputFormat > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.io.PcapInputFormat > >>> import net.ripe.hadoop.pcap.io.CombinePcapInputFormat > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.io.CombinePcapInputFormat > >>> import net.ripe.hadoop.pcap.packet.Packet > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named net.ripe.hadoop.pcap.packet.Packet > >>> > The same works great in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20552) Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala/Java and Python
[ https://issues.apache.org/jira/browse/SPARK-20552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20552. -- Resolution: Won't Fix I am resolving this. Please refer the discussion in the PR. > Add isNotDistinctFrom/isDistinctFrom for column APIs in Scala/Java and Python > -- > > Key: SPARK-20552 > URL: https://issues.apache.org/jira/browse/SPARK-20552 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > After SPARK-20463, we are able to use {{IS [NOT] DISTINCT FROM}} in Spark SQL. > It looks we should add {{isNotDistinctFrom}} (as an alias for {{eqNullSafe}}) > and {{isDistinctFrom}} (for a negated {{eqNullSafe}}) in both Scala/Java and > Python in Column APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992475#comment-15992475 ] Nick Pentreath commented on SPARK-20443: Were these tests against existing master? Because SPARK-11968 / [PR #17742|https://github.com/apache/spark/pull/17742] should make block size less relevant - we should of course re-test this once that PR is merged in, to see if it's worth exposing the parameter. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992471#comment-15992471 ] Teng Jiang commented on SPARK-20443: I did some tests on the blockSize. The test environment is: 3 workers: each work 40 core, each worker 180G memory, each worker 1 executor. The Data: user 3,290,000, and item 208,000 The results are: blockSize rank=10 rank = 100 128 67.32min 127.66min 256 46.68min 87.67min 512 35.66min 63.46min 1024 28.49min 41.61min 2048 22.83min 34.76min 4096 22.39min 54.43min 8192 23.35min 71.09min Another dataset with 480,000 users and 17,000 items. The rank was set to 10. blockSize 128 256 512 1024 2048 4096 8192 time (s)98.270.452.7 45.3 45.060.5 67.3 For both datasets, with the blockSize grows from 128 to 8192, the recommend time first decreases and then increases. Therefore, for different datasets, the optimal blockSize is different. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20549) java.io.CharConversionException: Invalid UTF-32 in JsonToStructs
[ https://issues.apache.org/jira/browse/SPARK-20549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20549. - Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 2.3.0 2.2.1 > java.io.CharConversionException: Invalid UTF-32 in JsonToStructs > > > Key: SPARK-20549 > URL: https://issues.apache.org/jira/browse/SPARK-20549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.2.1, 2.3.0 > > > The same fix for SPARK-16548 needs to be applied for JsonToStructs -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org