[jira] [Created] (SPARK-25693) Fix the multiple "manager" classes in org.apache.spark.deploy.security
Marcelo Vanzin created SPARK-25693: -- Summary: Fix the multiple "manager" classes in org.apache.spark.deploy.security Key: SPARK-25693 URL: https://issues.apache.org/jira/browse/SPARK-25693 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Vanzin In SPARK-23781 I've done some refactoring which introduces an {{AbstractCredentialManager}} class. That name sort of clashes with the existing {{HadoopDelegationTokenManager}}. Since the latter doesn't really manage anything, it just orchestrates the fetching of the tokens, we could rename it to something more generic. Or even clean up some of that class hierarchy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25692: Priority: Blocker (was: Major) > Flaky test: ChunkFetchIntegrationSuite > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Blocker > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 2.4 as this didn't happen before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25692: - Description: Looks like the whole test suite is pretty flaky. See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ This may be a regression in 3.0 as this didn't happen in 2.4 branch. was: Looks like the whole test suite is pretty flaky. See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ This may be a regression in 2.4 as this didn't happen before. > Flaky test: ChunkFetchIntegrationSuite > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Summary: URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue (was: URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Major > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Example exception when using scalaj.http in Spark: > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644270#comment-16644270 ] Apache Spark commented on SPARK-25682: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22681 > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25640) Clarify/Improve EvalType for grouped aggregate and window aggregate
[ https://issues.apache.org/jira/browse/SPARK-25640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25640: Target Version/s: 3.0.0 > Clarify/Improve EvalType for grouped aggregate and window aggregate > --- > > Key: SPARK-25640 > URL: https://issues.apache.org/jira/browse/SPARK-25640 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Currently, grouped aggregate and window aggregate uses different EvalType, > however, they map to the same user facing type PandasUDFType.GROUPED_MAP. > It makes sense to have one user facing type because it > (PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation. > However, the mismatching between PandasUDFType and EvalType can be confusing > to developers. We should clarify and/or improve this. > See discussion at: > https://github.com/apache/spark/pull/22620#discussion_r222452544 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25695) Spark history server event log store problem
Si Chen created SPARK-25695: --- Summary: Spark history server event log store problem Key: SPARK-25695 URL: https://issues.apache.org/jira/browse/SPARK-25695 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Reporter: Si Chen Envirment:spark 2.0.0 hadoop 2.7.3 spark-default.conf || spark.eventLog.dir {color:#d04437}file:/home/hdfs/event{color} spark.eventLog.enabled true spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.history.fs.logDirectory {color:#d04437}file:/home/hdfs/event{color} spark.history.kerberos.keytab none spark.history.kerberos.principal none spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18081 spark.yarn.historyServer.address slave6.htdata.com:18081 spark.yarn.queue default|| I want to save eventLog to local disk. When I submit spark job use client deploy mode, event log can write to local disk. But when I use cluster mode,The following problems arise.I am sure all servers have this path. !image-2018-10-10-13-51-45-452.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25695) Spark history server event log store problem
[ https://issues.apache.org/jira/browse/SPARK-25695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Si Chen updated SPARK-25695: Description: Envirment:spark 2.0.0 hadoop 2.7.3 spark-default.conf {code:java} spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.eventLog.dir file:/home/hdfs/event spark.eventLog.enabled true spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.history.fs.logDirectory file:/home/hdfs/event spark.history.kerberos.keytab none spark.history.kerberos.principal none spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18081 spark.yarn.historyServer.address slave6.htdata.com:18081 spark.yarn.queue default {code} I want to save eventLog to local disk. When I submit spark job use client deploy mode, event log can write to local disk. But when I use cluster mode,The following problems arise.I am sure all servers have this path. {code:java} 18/10/10 13:10:13 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1538963194112_0033 and attemptId Some(appattempt_1538963194112_0033_01) 18/10/10 13:10:13 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63016. 18/10/10 13:10:13 INFO netty.NettyBlockTransferService: Server created on 192.168.0.78:63016 18/10/10 13:10:13 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.78, 63016) 18/10/10 13:10:13 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.0.78:63016 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.78, 63016) 18/10/10 13:10:13 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.78, 63016) 18/10/10 13:10:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@32844dd3{/metrics/json,null,AVAILABLE} 18/10/10 13:10:13 ERROR spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/home/hdfs/event does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:624) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:850) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:614) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93) at org.apache.spark.SparkContext.(SparkContext.scala:516) at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:836) at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84) at com.iiot.stream.spark.HTMonitorContext$.main(HTMonitorContext.scala:23) at com.iiot.stream.spark.HTMonitorContext.main(HTMonitorContext.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) 18/10/10 13:10:13 INFO server.ServerConnector: Stopped ServerConnector@4fd401cf{HTTP/1.1}{0.0.0.0:0} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@73fe17d4{/stages/stage/kill,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@11e5f9d4{/api,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5d6fa266{/,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5920363{/static,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4494dda9{/executors/threadDump/json,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5a32de89{/executors/threadDump,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7c448c23{/executors/json,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@b0949d9{/executors,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6a3e90c6{/environment/json,null,UNAVAILABLE} 18/10/10 13:10:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6979e53f{/environment,null,UNAVAILABLE} {code} was:
[jira] [Commented] (SPARK-24291) Data source table is not displaying records when files are uploaded to table location
[ https://issues.apache.org/jira/browse/SPARK-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644435#comment-16644435 ] sandeep katta commented on SPARK-24291: --- [~cloud_fan] [~srowen] is this valid use case ? . Because normally load command will be used to load the data onto table. I don't find any valid scenario why the file is manually being copied to HDFS location > Data source table is not displaying records when files are uploaded to table > location > - > > Key: SPARK-24291 > URL: https://issues.apache.org/jira/browse/SPARK-24291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3 >Reporter: Sushanta Sen >Priority: Major > > Precondition: > 1.Already one .orc file exists in the /tmp/orcdata/ location > # Launch Spark-sql > # spark-sql> CREATE TABLE os_orc (name string, version string, other string) > USING ORC OPTIONS (path '/tmp/orcdata/'); > # spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 2.538 seconds, Fetched 1 row(s) > # pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found 1 items > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# *./hadoop fs -copyFromLocal > /opt/OS/loaddata/orcdata/part-1-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > /tmp/orcdata/data2.orc* > pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found *2* items > -rw-r--r-- 3 spark hadoop 475 2018-05-15 14:59 /tmp/orcdata/data2.orc > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# ** > 5. Again execute the select command on the table os_orc > spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 1.528 seconds, Fetched {color:#FF}1 row(s){color} > Actual Result: On executing select command it does not display the all the > records exist in the data source table location > Expected Result: All the records should be fetched and displayed for the data > source table from the location > NB: > 1.On exiting and relaunching the spark-sql session, select command fetches > the correct # of records. > 2.This issue is valid for all the data source tables created with 'Using' . > I came across this use case in Spark 2.2.1 when tried to reproduce a customer > site observation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24291) Data source table is not displaying records when files are uploaded to table location
[ https://issues.apache.org/jira/browse/SPARK-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644459#comment-16644459 ] Sean Owen commented on SPARK-24291: --- I don't know. I would not expect this to necessarily update. What are you comparing it to that suggests it should? > Data source table is not displaying records when files are uploaded to table > location > - > > Key: SPARK-24291 > URL: https://issues.apache.org/jira/browse/SPARK-24291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3 >Reporter: Sushanta Sen >Priority: Major > > Precondition: > 1.Already one .orc file exists in the /tmp/orcdata/ location > # Launch Spark-sql > # spark-sql> CREATE TABLE os_orc (name string, version string, other string) > USING ORC OPTIONS (path '/tmp/orcdata/'); > # spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 2.538 seconds, Fetched 1 row(s) > # pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found 1 items > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# *./hadoop fs -copyFromLocal > /opt/OS/loaddata/orcdata/part-1-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > /tmp/orcdata/data2.orc* > pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found *2* items > -rw-r--r-- 3 spark hadoop 475 2018-05-15 14:59 /tmp/orcdata/data2.orc > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# ** > 5. Again execute the select command on the table os_orc > spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 1.528 seconds, Fetched {color:#FF}1 row(s){color} > Actual Result: On executing select command it does not display the all the > records exist in the data source table location > Expected Result: All the records should be fetched and displayed for the data > source table from the location > NB: > 1.On exiting and relaunching the spark-sql session, select command fetches > the correct # of records. > 2.This issue is valid for all the data source tables created with 'Using' . > I came across this use case in Spark 2.2.1 when tried to reproduce a customer > site observation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21569) Internal Spark class needs to be kryo-registered
[ https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644394#comment-16644394 ] Lijun Cao commented on SPARK-21569: --- Register *org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage* in SparkConf manually. Be notice that without *.class* at the end. Refer KYLIN-3272 for details. > Internal Spark class needs to be kryo-registered > > > Key: SPARK-21569 > URL: https://issues.apache.org/jira/browse/SPARK-21569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Major > > [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf] > As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when > {{spark.kryo.registrationRequired=true}}) with: > {code} > java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage > Note: To register this class use: > kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This internal Spark class should be kryo-registered by Spark by default. > This was not a problem in 2.1.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25682: Assignee: Apache Spark > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25682: Assignee: (was: Apache Spark) > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644269#comment-16644269 ] Apache Spark commented on SPARK-25682: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22681 > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25685) Allow regression testing in enterprise Jenkins
[ https://issues.apache.org/jira/browse/SPARK-25685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25685: Assignee: Apache Spark > Allow regression testing in enterprise Jenkins > -- > > Key: SPARK-25685 > URL: https://issues.apache.org/jira/browse/SPARK-25685 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > Add some environment variables to allow regression testing in enterprise > Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25684) Organize header related codes in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-25684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642922#comment-16642922 ] Apache Spark commented on SPARK-25684: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/22676 > Organize header related codes in CSV datasource > --- > > Key: SPARK-25684 > URL: https://issues.apache.org/jira/browse/SPARK-25684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > JIRAs like SPARK-23786 and SPARK-25134 added some codes related with headers. > This ended up with difficult review and a bit convoluted code path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25683) Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime instead of 0
[ https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642936#comment-16642936 ] Apache Spark commented on SPARK-25683: -- User 'shivusondur' has created a pull request for this issue: https://github.com/apache/spark/pull/22677 > Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime > instead of 0 > - > > Key: SPARK-25683 > URL: https://issues.apache.org/jira/browse/SPARK-25683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Trivial > > {code:xml} > 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. > This likely means one of the listeners is too slow and cannot keep up with > the rate at which tasks are being started by the scheduler. > 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since > Wed Dec 31 16:00:00 PST 1969. > 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog > since Mon Oct 08 17:51:40 PDT 2018. > {code} > Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, I > think it would be better if we show the initialized time as the time here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25683) Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime instead of 0
[ https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25683: Assignee: Apache Spark > Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime > instead of 0 > - > > Key: SPARK-25683 > URL: https://issues.apache.org/jira/browse/SPARK-25683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Assignee: Apache Spark >Priority: Trivial > > {code:xml} > 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. > This likely means one of the listeners is too slow and cannot keep up with > the rate at which tasks are being started by the scheduler. > 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since > Wed Dec 31 16:00:00 PST 1969. > 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog > since Mon Oct 08 17:51:40 PDT 2018. > {code} > Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, I > think it would be better if we show the initialized time as the time here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25683) Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime instead of 0
[ https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25683: Assignee: (was: Apache Spark) > Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime > instead of 0 > - > > Key: SPARK-25683 > URL: https://issues.apache.org/jira/browse/SPARK-25683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Trivial > > {code:xml} > 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. > This likely means one of the listeners is too slow and cannot keep up with > the rate at which tasks are being started by the scheduler. > 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since > Wed Dec 31 16:00:00 PST 1969. > 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog > since Mon Oct 08 17:51:40 PDT 2018. > {code} > Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, I > think it would be better if we show the initialized time as the time here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25683) Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime instead of 0
[ https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642874#comment-16642874 ] shivusondur commented on SPARK-25683: - i am working on it > Make AsyncEventQueue.lastReportTimestamp inital value as the currentTime > instead of 0 > - > > Key: SPARK-25683 > URL: https://issues.apache.org/jira/browse/SPARK-25683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Trivial > > {code:xml} > 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. > This likely means one of the listeners is too slow and cannot keep up with > the rate at which tasks are being started by the scheduler. > 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since > Wed Dec 31 16:00:00 PST 1969. > 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog > since Mon Oct 08 17:51:40 PDT 2018. > {code} > Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, I > think it would be better if we show the initialized time as the time here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned SPARK-25497: Assignee: Wenchen Fan > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25684) Organize header related codes in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-25684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25684: Assignee: Hyukjin Kwon (was: Apache Spark) > Organize header related codes in CSV datasource > --- > > Key: SPARK-25684 > URL: https://issues.apache.org/jira/browse/SPARK-25684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > JIRAs like SPARK-23786 and SPARK-25134 added some codes related with headers. > This ended up with difficult review and a bit convoluted code path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25684) Organize header related codes in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-25684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25684: Assignee: Apache Spark (was: Hyukjin Kwon) > Organize header related codes in CSV datasource > --- > > Key: SPARK-25684 > URL: https://issues.apache.org/jira/browse/SPARK-25684 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > JIRAs like SPARK-23786 and SPARK-25134 added some codes related with headers. > This ended up with difficult review and a bit convoluted code path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25669) Check CSV header only when it exists
[ https://issues.apache.org/jira/browse/SPARK-25669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25669. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22656 [https://github.com/apache/spark/pull/22656] > Check CSV header only when it exists > > > Key: SPARK-25669 > URL: https://issues.apache.org/jira/browse/SPARK-25669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Currently, Spark checks the header in CSV files to fields names in provided > or inferred schema. The check is bypassed if the header doesn't exists and > CSV content is read from files. In the case, when input CSV comes as dataset > of strings, Spark always compares the first row to the user specified or > inferred schema. For example, parsing the following dataset: > {code:scala} > val input = Seq("1,2").toDS() > spark.read.option("enforceSchema", false).csv(input) > {code} > throws the exception: > {code:java} > java.lang.IllegalArgumentException: CSV header does not conform to the schema. > Header: 1, 2 > Schema: _c0, _c1 > Expected: _c0 but found: 1 > {code} > Need to prevent comparison of the first row (if it is not a header) to > specific or inferred schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25669) Check CSV header only when it exists
[ https://issues.apache.org/jira/browse/SPARK-25669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25669: - Fix Version/s: 3.0.0 > Check CSV header only when it exists > > > Key: SPARK-25669 > URL: https://issues.apache.org/jira/browse/SPARK-25669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0, 3.0.0 > > > Currently, Spark checks the header in CSV files to fields names in provided > or inferred schema. The check is bypassed if the header doesn't exists and > CSV content is read from files. In the case, when input CSV comes as dataset > of strings, Spark always compares the first row to the user specified or > inferred schema. For example, parsing the following dataset: > {code:scala} > val input = Seq("1,2").toDS() > spark.read.option("enforceSchema", false).csv(input) > {code} > throws the exception: > {code:java} > java.lang.IllegalArgumentException: CSV header does not conform to the schema. > Header: 1, 2 > Schema: _c0, _c1 > Expected: _c0 but found: 1 > {code} > Need to prevent comparison of the first row (if it is not a header) to > specific or inferred schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25669) Check CSV header only when it exists
[ https://issues.apache.org/jira/browse/SPARK-25669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25669: Assignee: Maxim Gekk > Check CSV header only when it exists > > > Key: SPARK-25669 > URL: https://issues.apache.org/jira/browse/SPARK-25669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0, 3.0.0 > > > Currently, Spark checks the header in CSV files to fields names in provided > or inferred schema. The check is bypassed if the header doesn't exists and > CSV content is read from files. In the case, when input CSV comes as dataset > of strings, Spark always compares the first row to the user specified or > inferred schema. For example, parsing the following dataset: > {code:scala} > val input = Seq("1,2").toDS() > spark.read.option("enforceSchema", false).csv(input) > {code} > throws the exception: > {code:java} > java.lang.IllegalArgumentException: CSV header does not conform to the schema. > Header: 1, 2 > Schema: _c0, _c1 > Expected: _c0 but found: 1 > {code} > Need to prevent comparison of the first row (if it is not a header) to > specific or inferred schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25684) Organize header related codes in CSV datasource
Hyukjin Kwon created SPARK-25684: Summary: Organize header related codes in CSV datasource Key: SPARK-25684 URL: https://issues.apache.org/jira/browse/SPARK-25684 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon JIRAs like SPARK-23786 and SPARK-25134 added some codes related with headers. This ended up with difficult review and a bit convoluted code path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25497. -- Resolution: Fixed Fix Version/s: 3.0.0 > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25685) Allow regression testing in enterprise Jenkins
[ https://issues.apache.org/jira/browse/SPARK-25685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25685: Assignee: (was: Apache Spark) > Allow regression testing in enterprise Jenkins > -- > > Key: SPARK-25685 > URL: https://issues.apache.org/jira/browse/SPARK-25685 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > Add some environment variables to allow regression testing in enterprise > Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25685) Allow regression testing in enterprise Jenkins
[ https://issues.apache.org/jira/browse/SPARK-25685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642963#comment-16642963 ] Apache Spark commented on SPARK-25685: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22678 > Allow regression testing in enterprise Jenkins > -- > > Key: SPARK-25685 > URL: https://issues.apache.org/jira/browse/SPARK-25685 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > Add some environment variables to allow regression testing in enterprise > Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25685) Allow regression testing in enterprise Jenkins
[ https://issues.apache.org/jira/browse/SPARK-25685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642966#comment-16642966 ] Apache Spark commented on SPARK-25685: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22678 > Allow regression testing in enterprise Jenkins > -- > > Key: SPARK-25685 > URL: https://issues.apache.org/jira/browse/SPARK-25685 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > Add some environment variables to allow regression testing in enterprise > Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25685) Allow regression testing in enterprise Jenkins
Lantao Jin created SPARK-25685: -- Summary: Allow regression testing in enterprise Jenkins Key: SPARK-25685 URL: https://issues.apache.org/jira/browse/SPARK-25685 Project: Spark Issue Type: Bug Components: Build, Tests Affects Versions: 2.3.2 Reporter: Lantao Jin Add some environment variables to allow regression testing in enterprise Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25685) Allow running tests in Jenkins in enterprise Git repository
[ https://issues.apache.org/jira/browse/SPARK-25685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-25685: --- Summary: Allow running tests in Jenkins in enterprise Git repository (was: Allow regression testing in enterprise Jenkins) > Allow running tests in Jenkins in enterprise Git repository > --- > > Key: SPARK-25685 > URL: https://issues.apache.org/jira/browse/SPARK-25685 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > Add some environment variables to allow regression testing in enterprise > Jenkins instead of default Spark repository in GitHub. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643235#comment-16643235 ] Mary Scott commented on SPARK-20780: Hello [~jayadeepj]. I am struggling with the identical issue. Have you found any workaround/solution for the problem? > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > Yarn - Cluster Mode > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj >Priority: Major > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs = true > request.timeout.ms = 5 > heartbeat.interval.ms = 3000 > auto.commit.interval.ms = 5000 > receive.buffer.bytes = 65536 > ssl.truststore.type = JKS > ssl.truststore.location = null > ssl.keystore.password = null > fetch.min.bytes = 1 > send.buffer.bytes = 131072 > value.deserializer = class >
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643663#comment-16643663 ] Imran Rashid commented on SPARK-25344: -- [~hyukjin.kwon] go for it! I dunno if [~bryanc] has stronger feelings about how the code should be organized. Again, if this turns into a headache to do it one go, I think it would be easier to first get all new tests to go into separate files, and then make this change piecemeal, as long as we know the organization we're aiming for. > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18660) Parquet complains "Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImp
[ https://issues.apache.org/jira/browse/SPARK-18660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643418#comment-16643418 ] Paul Praet commented on SPARK-18660: It's really polluting our logs. Any workaround ? > Parquet complains "Can not initialize counter due to context is not a > instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl " > -- > > Key: SPARK-18660 > URL: https://issues.apache.org/jira/browse/SPARK-18660 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Major > > Parquet record reader always complain "Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl". Looks like we > always create TaskAttemptContextImpl > (https://github.com/apache/spark/blob/2f7461f31331cfc37f6cfa3586b7bbefb3af5547/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L368). > But, Parquet wants to use TaskInputOutputContext, which is a subclass of > TaskAttemptContextImpl. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-24851: - Assignee: Parth Gandhi > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Trivial > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped
Zack Behringer created SPARK-25686: -- Summary: date_trunc Spark SQL function silently returns null if parameters are swapped Key: SPARK-25686 URL: https://issues.apache.org/jira/browse/SPARK-25686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: Zack Behringer date_trunc(a_timestamp, 'minute') returns null date_trunc('minute', a_timestamp) returns a valid timestamp it would be nice to have a runtime error to help catch the problem This was not helped by the fact that the doc examples had it swapped, but yes I should have tested our use of it more thoroughly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25535) Work around bad error checking in commons-crypto
[ https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-25535: Assignee: Marcelo Vanzin > Work around bad error checking in commons-crypto > > > Key: SPARK-25535 > URL: https://issues.apache.org/jira/browse/SPARK-25535 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > > The commons-crypto library used for encryption can get confused when certain > errors happen; that can lead to crashes since the Java side thinks the > ciphers are still valid while the native side has already cleaned up the > ciphers. > We can work around that in Spark by doing some error checking at a higher > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24851) Map a Stage ID to it's Associated Job ID in UI
[ https://issues.apache.org/jira/browse/SPARK-24851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-24851. --- Resolution: Fixed Fix Version/s: 3.0.0 > Map a Stage ID to it's Associated Job ID in UI > -- > > Key: SPARK-24851 > URL: https://issues.apache.org/jira/browse/SPARK-24851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Trivial > Fix For: 3.0.0 > > > It would be nice to have a field in Stage Page UI which would show mapping of > the current stage id to the job id's to which that stage belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25535) Work around bad error checking in commons-crypto
[ https://issues.apache.org/jira/browse/SPARK-25535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-25535. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22557 [https://github.com/apache/spark/pull/22557] > Work around bad error checking in commons-crypto > > > Key: SPARK-25535 > URL: https://issues.apache.org/jira/browse/SPARK-25535 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > The commons-crypto library used for encryption can get confused when certain > errors happen; that can lead to crashes since the Java side thinks the > ciphers are still valid while the native side has already cleaned up the > ciphers. > We can work around that in Spark by doing some error checking at a higher > level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22388) Limit push down
[ https://issues.apache.org/jira/browse/SPARK-22388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643478#comment-16643478 ] Aleksey Zinoviev commented on SPARK-22388: -- [~cloud_fan] Could you please add any description here? > Limit push down > --- > > Key: SPARK-22388 > URL: https://issues.apache.org/jira/browse/SPARK-22388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25687) A dataset can store a column as sequence of Vectors but not directly vectors
Francisco Orchard created SPARK-25687: - Summary: A dataset can store a column as sequence of Vectors but not directly vectors Key: SPARK-25687 URL: https://issues.apache.org/jira/browse/SPARK-25687 Project: Spark Issue Type: Bug Components: ML, SQL Affects Versions: 2.3.1 Reporter: Francisco Orchard A dataset can store an array of vectors but not a vector. This is inconsistent. To reproduce: { import org.apache.spark.sql.Row import org.apache.spark.ml.linalg.\{Vectors, DenseVector, Vector} import org.apache.spark.ml.linalg.SQLDataTypes.VectorType import org.apache.spark.sql.types._ import spark.implicits._ val rdd = sc.parallelize(Seq(Row(Seq(Vectors.dense(Array(1.0, 2.0)).toSparse val arrayOfVectorsDS = spark.createDataFrame(rowRDD= rdd, schema = new StructType(Array(StructField(name = "value", dataType = ArrayType(elementType = VectorType).as[Seq[Vector]] // val vectorsDS = arrayOfVectorsDS.flatMap(a => a) .show } If the line before ".show" is uncommented this code will throw the well known error: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643762#comment-16643762 ] Bryan Cutler commented on SPARK-25344: -- No I don't have strong feelings, my only preference was to put the tests in a subdir. I'm sure it's quite a lot of work, so however you see fit [~hyukjin.kwon] is fine! If you are planning on breaking this up into tasks, I can try to help out as well. > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643765#comment-16643765 ] Apache Spark commented on SPARK-25559: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22679 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643761#comment-16643761 ] Apache Spark commented on SPARK-25559: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22679 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644162#comment-16644162 ] Shixiong Zhu commented on SPARK-25692: -- It may be caused by https://github.com/apache/spark/pull/22173 > Flaky test: ChunkFetchIntegrationSuite > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 3.0 as this didn't happen in 2.4 branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection
Bo Yang created SPARK-25694: --- Summary: URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection Key: SPARK-25694 URL: https://issues.apache.org/jira/browse/SPARK-25694 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0 Reporter: Bo Yang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Description: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala {quote}object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {quote} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:http://.example.com (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection > - > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Major > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala > {quote}object SharedState extends Logging { > ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) > ... > } > {quote} > > Example exception when using scalaj.http in Spark: > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:http://.example.com (of class > org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Description: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: {code} object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {code} Here is the example exception when using scalaj.http in Spark: {code} StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) {code} One option to fix the issue is to return null in URLStreamHandlerFactory.createURLStreamHandler when the protocol is http/https, so it will use the default behavior and be compatible with scalaj.http. Following is the code example: {code} class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler null } else { handler } } } {code} I would like to get some discussion here before submitting a pull request. was: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: {code} object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {code} Here is the example exception when using scalaj.http in Spark: {code} StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) {code} One option to fix the issue is to return null in URLStreamHandlerFactory.createURLStreamHandler when the protocol is http/https, so it will use the default behavior and be compatible with scalaj.http. Following is the code example: {code} class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler null } else { handler } } } {code} > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: >
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644198#comment-16644198 ] Yinan Li commented on SPARK-25682: -- Cool, thanks! > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644157#comment-16644157 ] Yinan Li commented on SPARK-25682: -- That looks like to me the only difference. {{bin}}, {{sbin}}, and {{data}} are also hard-coded but they appear to be the same between the source and a distribution. Are you working on a fix? > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Description: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala {quote}object SharedState extends Logging Unknown macro: \{ ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... }{quote} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) One option to fix the issue is: {quote}class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler logDebug("Use system default URLStreamHandler for " + protocol) null } else { handler } } }{quote} was: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala {quote}object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {quote} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:http://.example.com (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection > - > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Major > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala > {quote}object SharedState extends Logging > Unknown macro: \{ ... URL.setURLStreamHandlerFactory(new > FsUrlStreamHandlerFactory()) ... }{quote} > > Example exception when using scalaj.http in Spark: > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > > > One option to fix the issue is: > > {quote}class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory > with Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > logDebug("Use system default URLStreamHandler for " + protocol) > null > } else { > handler > } > } > }{quote} > > -- This message was sent by
[jira] [Commented] (SPARK-25682) Docker images generated from dev build and from dist tarball are different
[ https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644194#comment-16644194 ] Marcelo Vanzin commented on SPARK-25682: I was busy with other things but I can send a patch later today. > Docker images generated from dev build and from dist tarball are different > -- > > Key: SPARK-25682 > URL: https://issues.apache.org/jira/browse/SPARK-25682 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's at least one difference I noticed, because of this line: > {noformat} > COPY examples /opt/spark/examples > {noformat} > In a dev build, "examples" contains your usual source code and maven-style > directories, whereas in the dist version, it's this: > {code} > cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars" > {code} > So the path to the actual jar files ends up being different depending on how > you built the image. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite
Shixiong Zhu created SPARK-25692: Summary: Flaky test: ChunkFetchIntegrationSuite Key: SPARK-25692 URL: https://issues.apache.org/jira/browse/SPARK-25692 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Shixiong Zhu Looks like the whole test suite is pretty flaky. See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ This may be a regression in 2.4 as this didn't happen before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite
[ https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25692: Affects Version/s: (was: 2.4.0) 3.0.0 > Flaky test: ChunkFetchIntegrationSuite > -- > > Key: SPARK-25692 > URL: https://issues.apache.org/jira/browse/SPARK-25692 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Priority: Blocker > > Looks like the whole test suite is pretty flaky. See: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/ > This may be a regression in 2.4 as this didn't happen before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Priority: Minor (was: Major) > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Example exception when using scalaj.http in Spark: > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Description: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: {code} object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {code} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) One option to fix the issue is to return null in URLStreamHandlerFactory.createURLStreamHandler when the protocol is http/https, so it will use the default behavior and be compatible with scalaj.http. Following is the code example: {code} class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler null } else { handler } } } {code} was: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala {quote}object SharedState extends Logging Unknown macro: \{ ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... }{quote} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) One option to fix the issue is: {quote}class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler logDebug("Use system default URLStreamHandler for " + protocol) null } else { handler } } }{quote} > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection > - > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Major > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Example exception when using scalaj.http in Spark: > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335)
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Yang updated SPARK-25694: Description: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: {code} object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {code} Here is the example exception when using scalaj.http in Spark: {code} StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) {code} One option to fix the issue is to return null in URLStreamHandlerFactory.createURLStreamHandler when the protocol is http/https, so it will use the default behavior and be compatible with scalaj.http. Following is the code example: {code} class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler null } else { handler } } } {code} was: URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() returns FsUrlConnection object, which is not compatible with HttpURLConnection. This will cause exception when using some third party http library (e.g. scalaj.http). The following code in Spark 2.3.0 introduced the issue: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: {code} object SharedState extends Logging { ... URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... } {code} Example exception when using scalaj.http in Spark: StackTrace: scala.MatchError: org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] (of class org.apache.hadoop.fs.FsUrlConnection) at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) at scalaj.http.HttpRequest.exec(Http.scala:335) at scalaj.http.HttpRequest.asString(Http.scala:455) One option to fix the issue is to return null in URLStreamHandlerFactory.createURLStreamHandler when the protocol is http/https, so it will use the default behavior and be compatible with scalaj.http. Following is the code example: {code} class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with Logging { private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() override def createURLStreamHandler(protocol: String): URLStreamHandler = { val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) if (handler == null) { return null } if (protocol != null && (protocol.equalsIgnoreCase("http") || protocol.equalsIgnoreCase("https"))) { // return null to use system default URLStreamHandler null } else { handler } } } {code} > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Bo Yang >Priority: Minor > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class
[jira] [Commented] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644273#comment-16644273 ] Xiao Li commented on SPARK-25690: - The changes made in https://issues.apache.org/jira/browse/SPARK-25044 broke the rule HandleNullInputsForUDF. It is not idempotent any more. Since AnalysisBarrier is removed in 2.4 release, this is not blocker based on my evaluation. However, we should still fix it. cc [~cloud_fan] [~rxin] > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Priority: Major > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643828#comment-16643828 ] Shay Elbaz commented on SPARK-19256: +1 [~tejasp] is this still under progress? > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25688: Priority: Blocker (was: Major) > org.apache.spark.sql.FileBasedDataSourceSuite never pass > > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Blocker > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. If we are still unable to > find the root cause of ORC. I would suggest to remove these ORC test cases. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Updated] (SPARK-25688) Potential resource leak in ORC
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25688: Summary: Potential resource leak in ORC (was: org.apache.spark.sql.FileBasedDataSourceSuite never pass) > Potential resource leak in ORC > -- > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >
[jira] [Commented] (SPARK-25688) Potential resource leak in ORC
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643926#comment-16643926 ] Dongjoon Hyun commented on SPARK-25688: --- [~smilegator] . This is a duplication of SPARK-23390. Please see SPARK-23390. We also observed `parquet` failure, too. The reason why we see ORC error more frequently is that ORC is tested at the first and it eventually hides similar Parquet failures. {code:java} private val allFileBasedDataSources = Seq("orc", "parquet", "csv", "json", "text"){code} > Potential resource leak in ORC > -- > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at
[jira] [Created] (SPARK-25689) Move token renewal logic to driver in yarn-client mode
Marcelo Vanzin created SPARK-25689: -- Summary: Move token renewal logic to driver in yarn-client mode Key: SPARK-25689 URL: https://issues.apache.org/jira/browse/SPARK-25689 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.4.0 Reporter: Marcelo Vanzin Currently, both in yarn-cluster and yarn-client mode, the YARN AM is responsible for renewing delegation tokens. That differs from other RMs (Mesos and later k8s when it supports this functionality), and is one of the roadblocks towards fully sharing the same delegation token-related code. We should look at keeping the renewal logic within the driver in yarn-client mode. That would also remove the need to distribute the user's keytab to the AM when running in that particular mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25688) Potential resource leak in ORC
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25688. --- Resolution: Duplicate Sorry, but let's gether the same error report into one JIRA. > Potential resource leak in ORC > -- > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at
[jira] [Updated] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Summary: Flaky test: FileBasedDataSourceSuite (was: Flaky test: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7) > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] > (May 3rd) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] > (May 9th) > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] > (May 11st) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/342/] > (May 16th) > - >
[jira] [Commented] (SPARK-25493) CRLF Line Separators don't work in multiline CSVs
[ https://issues.apache.org/jira/browse/SPARK-25493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643805#comment-16643805 ] Apache Spark commented on SPARK-25493: -- User 'justinuang' has created a pull request for this issue: https://github.com/apache/spark/pull/22680 > CRLF Line Separators don't work in multiline CSVs > - > > Key: SPARK-25493 > URL: https://issues.apache.org/jira/browse/SPARK-25493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Justin Uang >Priority: Major > > CSVs with windows style crlf (carriage return line feed) don't work in > multiline mode. They work fine in single line mode because the line > separation is done by Hadoop, which can handle all the different types of > line separators. In multiline mode, the Univocity parser is used to also > handle splitting of records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass
Xiao Li created SPARK-25688: --- Summary: org.apache.spark.sql.FileBasedDataSourceSuite never pass Key: SPARK-25688 URL: https://issues.apache.org/jira/browse/SPARK-25688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li Assignee: Dongjoon Hyun http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 All the test failure is caused by the ORC internal. If we are still unable to find the root cause of ORC. I would suggest to remove these ORC test cases. {code} org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.019369471 seconds. Last failure message: There are 1 possibly leaked file streams.. sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.019369471 seconds. Last failure message: There are 1 possibly leaked file streams.. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) at org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) at org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) at org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) at org.scalatest.Status$class.withAfterEffect(Status.scala:375) at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) at org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 possibly leaked file streams. at
[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25688: Priority: Critical (was: Blocker) > org.apache.spark.sql.FileBasedDataSourceSuite never pass > > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. If we are still unable to > find the root cause of ORC. I would suggest to remove these ORC test cases. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at
[jira] [Commented] (SPARK-25688) Potential resource leak in ORC
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643870#comment-16643870 ] Xiao Li commented on SPARK-25688: - It sounds like ORC still has a resource leak even after the latest version upgrade. > Potential resource leak in ORC > -- > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643936#comment-16643936 ] Ankur Gupta commented on SPARK-24523: - Hi [~umayr_nuna] - just checking if you had a chance to try the above configurations. If you did, then did it help? > InterruptedException when closing SparkContext > -- > > Key: SPARK-24523 > URL: https://issues.apache.org/jira/browse/SPARK-24523 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0, 2.3.1 > Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17 > > > >Reporter: Umayr Hassan >Priority: Major > Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, > spark-stop-jstack.log.3 > > > I'm running a Scala application in EMR with the following properties: > {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory > 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=20 --conf > spark.eventLog.dir=hdfs:///var/log/spark/apps --conf > spark.eventLog.enabled=true --conf > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf > spark.scheduler.listenerbus.eventqueue.capacity=2 --conf > spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 > --conf spark.yarn.maxAppAttempts=1}} > The application runs fine till SparkContext is (automatically) closed, at > which point the SparkContext object throws. > {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException at java.lang.Object.wait(Native Method) at > java.lang.Thread.join(Thread.java:1252) at > java.lang.Thread.join(Thread.java:1326) at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at > scala.collection.AbstractIterable.foreach(Iterable.scala:54) at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at > org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)}} > > I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same > application), so I'm not sure which change is causing Spark 2.3 to throw. Any > ideas? > best, > Umayr -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25688: Description: http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 All the test failure is caused by the ORC internal. {code} org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.019369471 seconds. Last failure message: There are 1 possibly leaked file streams.. sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.019369471 seconds. Last failure message: There are 1 possibly leaked file streams.. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) at org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) at org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) at org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) at org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) at org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) at org.scalatest.Status$class.withAfterEffect(Status.scala:375) at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) at org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 possibly leaked file streams. at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) at org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:133) at org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply(SharedSparkSession.scala:133) at
[jira] [Commented] (SPARK-25688) Potential resource leak in ORC
[ https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643910#comment-16643910 ] Xiao Li commented on SPARK-25688: - https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5015/consoleText Is this from `Enabling/disabling ignoreMissingFiles using orc`? > Potential resource leak in ORC > -- > > Key: SPARK-25688 > URL: https://issues.apache.org/jira/browse/SPARK-25688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Critical > > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29 > > All the test failure is caused by the ORC internal. > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.019369471 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >
[jira] [Updated] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Priority: Critical (was: Major) > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Critical > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] > (May 3rd) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] > (May 9th) > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] > (May 11st) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/342/] > (May 16th) > - >
[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644036#comment-16644036 ] Shixiong Zhu commented on SPARK-23390: -- I didn't look at parquet. It may have a similar issue. > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Critical > > *RECENT HISTORY* > [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29] > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] > (May 3rd) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] > (May 9th) > -
[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643960#comment-16643960 ] Dongjoon Hyun commented on SPARK-23390: --- As reported in SPARK-25688 by [~smilegator] , all recent reported failures in 6 month occurs ORC data sources . I'll reinvestigate this further. > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Critical > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] > (May 3rd) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] > (May 9th) > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] > (May 11st) > - >
[jira] [Created] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
Maryann Xue created SPARK-25690: --- Summary: Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely Key: SPARK-25690 URL: https://issues.apache.org/jira/browse/SPARK-25690 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 2.4.0 Reporter: Maryann Xue Assignee: Sean Owen Fix For: 2.4.0 A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 Fix HandleNullInputsForUDF rule": {code:java} - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED *** Results do not match for query: ... == Results == == Results == !== Correct Answer - 3 == == Spark Answer - 3 == !struct<> struct ![0,10,null] [0,10,0] ![1,12,null] [1,12,1] ![2,14,null] [2,14,2] (QueryTest.scala:163){code} You can kind of get what's going on reading the test: {code:java} test("SPARK-24891 Fix HandleNullInputsForUDF rule") { // assume(!ClosureCleanerSuite2.supportsLMFs) // This test won't test what it intends to in 2.12, as lambda metafactory closures // have arg types that are not primitive, but Object val udf1 = udf({(x: Int, y: Int) => x + y}) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1($"a", udf1($"a", lit(10 .withColumn("c", udf1($"a", lit(null))) val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed comparePlans(df.logicalPlan, plan) checkAnswer( df, Seq( Row(0, 10, null), Row(1, 12, null), Row(2, 14, null))) }{code} It seems that the closure that is fed in as a UDF changes behavior, in a way that primitive-type arguments are handled differently. For example an Int argument, when fed 'null', acts like 0. I'm sure it's a difference in the LMF closure and how its types are understood, but not exactly sure of the cause yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25690: Description: This was fixed in SPARK-24891 and was then broken by SPARK-25044. The tests added in SPARK-24891 were not good enough and the expected failures were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. Code changes and tests in [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] can help reproduce the issue. was: A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 Fix HandleNullInputsForUDF rule": {code:java} - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED *** Results do not match for query: ... == Results == == Results == !== Correct Answer - 3 == == Spark Answer - 3 == !struct<> struct ![0,10,null] [0,10,0] ![1,12,null] [1,12,1] ![2,14,null] [2,14,2] (QueryTest.scala:163){code} You can kind of get what's going on reading the test: {code:java} test("SPARK-24891 Fix HandleNullInputsForUDF rule") { // assume(!ClosureCleanerSuite2.supportsLMFs) // This test won't test what it intends to in 2.12, as lambda metafactory closures // have arg types that are not primitive, but Object val udf1 = udf({(x: Int, y: Int) => x + y}) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1($"a", udf1($"a", lit(10 .withColumn("c", udf1($"a", lit(null))) val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed comparePlans(df.logicalPlan, plan) checkAnswer( df, Seq( Row(0, 10, null), Row(1, 12, null), Row(2, 14, null))) }{code} It seems that the closure that is fed in as a UDF changes behavior, in a way that primitive-type arguments are handled differently. For example an Int argument, when fed 'null', acts like 0. I'm sure it's a difference in the LMF closure and how its types are understood, but not exactly sure of the cause yet. > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25691) Analyzer rule "AliasViewChild" does not stabilize
Maryann Xue created SPARK-25691: --- Summary: Analyzer rule "AliasViewChild" does not stabilize Key: SPARK-25691 URL: https://issues.apache.org/jira/browse/SPARK-25691 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Maryann Xue To reproduce the issue: https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R73. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25650) Make analyzer rules used in once-policy idempotent
[ https://issues.apache.org/jira/browse/SPARK-25650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25650: Description: Rules like {{HandleNullInputsForUDF}} (https://issues.apache.org/jira/browse/SPARK-24891) do not stabilize (can apply new changes to a plan indefinitely) and can cause problems like SQL cache mismatching. Ideally, all rules whether in a once-policy batch or a fixed-point-policy batch should stabilize after the number of runs specified. Once-policy should be considered a performance improvement, a assumption that the rule can stabilize after just one run rather than an assumption that the rule won't be applied more than once. Those once-policy rules should be able to run fine with fixed-point policy rule as well. Currently we already have a check for fixed-point and throws an exception if maximum number of runs is reached and the plan is still changing. Here, in this PR, a similar check is added for once-policy and throws an exception if the plan changes between the first run and the second run of a once-policy rule. To reproduce this issue, go to [https://github.com/apache/spark/pull/22060], apply the changes and remove the specific rule from the whitelist https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R71. was: Rules like {{HandleNullInputsForUDF}} (https://issues.apache.org/jira/browse/SPARK-24891) do not stabilize (can apply new changes to a plan indefinitely) and can cause problems like SQL cache mismatching. Ideally, all rules whether in a once-policy batch or a fixed-point-policy batch should stabilize after the number of runs specified. Once-policy should be considered a performance improvement, a assumption that the rule can stabilize after just one run rather than an assumption that the rule won't be applied more than once. Those once-policy rules should be able to run fine with fixed-point policy rule as well. Currently we already have a check for fixed-point and throws an exception if maximum number of runs is reached and the plan is still changing. Here, in this PR, a similar check is added for once-policy and throws an exception if the plan changes between the first run and the second run of a once-policy rule. To reproduce this issue, go to [https://github.com/apache/spark/pull/22060] and apply the changes. > Make analyzer rules used in once-policy idempotent > -- > > Key: SPARK-25650 > URL: https://issues.apache.org/jira/browse/SPARK-25650 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.2 >Reporter: Maryann Xue >Priority: Major > > Rules like {{HandleNullInputsForUDF}} > (https://issues.apache.org/jira/browse/SPARK-24891) do not stabilize (can > apply new changes to a plan indefinitely) and can cause problems like SQL > cache mismatching. > Ideally, all rules whether in a once-policy batch or a fixed-point-policy > batch should stabilize after the number of runs specified. Once-policy should > be considered a performance improvement, a assumption that the rule can > stabilize after just one run rather than an assumption that the rule won't be > applied more than once. Those once-policy rules should be able to run fine > with fixed-point policy rule as well. > Currently we already have a check for fixed-point and throws an exception if > maximum number of runs is reached and the plan is still changing. Here, in > this PR, a similar check is added for once-policy and throws an exception if > the plan changes between the first run and the second run of a once-policy > rule. > To reproduce this issue, go to [https://github.com/apache/spark/pull/22060], > apply the changes and remove the specific rule from the whitelist > https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R71. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Description: *RECENT HISTORY* [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29] We're seeing multiple failures in {{FileBasedDataSourceSuite}} in {{spark-branch-2.3-test-sbt-hadoop-2.7}}: {code:java} org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.01215805999 seconds. Last failure message: There are 1 possibly leaked file streams.. {code} Here's the full history: [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] >From a very quick look, these failures seem to be correlated with >[https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from >the following stack trace (full logs >[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): {code:java} [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: java.lang.Throwable at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) {code} Also, while this might be just a false correlation but the frequency of these test failures have increased considerably in [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] after [https://github.com/apache/spark/pull/20562] (cc [~feng...@databricks.com]) was merged. The following is Parquet leakage. {code:java} Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) {code} - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] (May 3rd) - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] (May 9th) - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] (May 11st) - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/342/] (May 16th) - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/347/] (May 19th) - [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/367/] (June 2nd) was: We're seeing multiple failures in {{FileBasedDataSourceSuite}} in {{spark-branch-2.3-test-sbt-hadoop-2.7}}: {code:java}
[jira] [Updated] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Affects Version/s: 2.4.0 > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Critical > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/] > (May 3rd) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/] > (May 9th) > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] > (May 11st) > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/342/] > (May 16th) > - >
[jira] [Updated] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25690: Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-14220) > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25690: Issue Type: Sub-task (was: Bug) Parent: SPARK-25650 > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Sean Owen >Priority: Major > Fix For: 2.4.0 > > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644034#comment-16644034 ] Shixiong Zhu commented on SPARK-23390: -- I think the issue is probably in orc. Any exception throwing between https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L252 and https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L273 will leak `dataReader`. For example, cancelling a Spark task may cause https://github.com/apache/orc/blob/b21b5ffcc1efcbd4aef337fa6faae4d25262f8f1/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L273 throw an exception. > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Critical > > *RECENT HISTORY* > [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29] > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code:java} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/] > From a very quick look, these failures seem to be correlated with > [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code:java} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/] > after [https://github.com/apache/spark/pull/20562] (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code:java} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at >
[jira] [Assigned] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25690: - Assignee: (was: Sean Owen) Fix Version/s: (was: 2.4.0) > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Priority: Major > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely
[ https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644117#comment-16644117 ] Sean Owen commented on SPARK-25690: --- I'm unclear what the particular problem is here. Is there a test case we can add that demonstrates the issue? Does it only affect 2.12? Why are there two separate child JIRAs? This is not fixed in 2.4, according to you, nor should I be assigned. > Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied > infinitely > --- > > Key: SPARK-25690 > URL: https://issues.apache.org/jira/browse/SPARK-25690 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Sean Owen >Priority: Major > > This was fixed in SPARK-24891 and was then broken by SPARK-25044. > The tests added in SPARK-24891 were not good enough and the expected failures > were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. > Code changes and tests in > [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72] > can help reproduce the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org