[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785277#comment-16785277 ] Jungtaek Lim commented on SPARK-26998: -- [~toopt4] Yeah I tend to agree that hiding more credential things are better so supportive on the change. Maybe I thought about the description of Jira issue your patch was originally landed. Btw, are there any existing test or manual test to verify whether keystore password and key password are not used? Just curious, I honestly don't know about it. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
TAESUK KIM created SPARK-27069: -- Summary: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) Key: SPARK-27069 URL: https://issues.apache.org/jira/browse/SPARK-27069 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Below is my environment DataSet # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) # Word : about 3553918(can't change) Spark environment # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) # executor-core,driver-core : 3 # spark.serializer : default and org.apache.spark.serializer.KryoSerializer(both fail) # spark.executor.memoryOverhead : 18G --> 36G fail Jave version : 1.8.0_191 (Oracle Corporation) Reporter: TAESUK KIM I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed version , ml ) using Spark 2.3.2(emr-5.18.0) . After that I want to transform new DataSet by using that model. But when I transform new data, I alway get error related memory error. I changed data size from x 0.1 , to x 0.01. But always get memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) That hugeCapacity error(overflow) is happened when size of array is over Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why this error is happened. And I want to change serializer to KryoSerializer. But I found this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call org.apache.spark.serializer.JavaSerializationStream even though I register KryoClasses Is there any thing I can do ? Below is code {{val countvModel = CountVectorizerModel.load("s3://~/") }} {{val ldaModel = DistributedLDAModel.load("s3://~/") }} {{val transformeddata=countvModel.transform(inputData).select("productid", "itemid", "ptkString", "features") var featureldaDF = ldaModel.transform(transformeddata).select("productid", "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }} Other testing # Java option : UseParallelGC , UseG1GC (all fail) Below is log {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107) at
[jira] [Created] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
zhoukang created SPARK-27068: Summary: Support failed jobs ui and completed jobs ui use different queue Key: SPARK-27068 URL: https://issues.apache.org/jira/browse/SPARK-27068 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0 Reporter: zhoukang For some long running jobs,we may want to check out the cause of some failed jobs. But most jobs has completed and failed jobs ui may disappear, we can use different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27045) SQL tab in UI shows callsite instead of actual SQL
[ https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785077#comment-16785077 ] Dongjoon Hyun commented on SPARK-27045: --- [~ajithshetty]. If this is not a regression at 2.3.2, we had better make this an `Improvement` issue. > SQL tab in UI shows callsite instead of actual SQL > -- > > Key: SPARK-27045 > URL: https://issues.apache.org/jira/browse/SPARK-27045 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.3.2, 2.3.3, 3.0.0 >Reporter: Ajith S >Priority: Major > Attachments: image-2019-03-04-18-24-27-469.png, > image-2019-03-04-18-24-54-053.png > > > When we run sql in spark ( for example via thrift server), the SparkUI SQL > tab must show SQL instead of stacktrace which is more useful to end user. > Instead in description column it currently shows the callsite shortform which > is less useful > Actual: > !image-2019-03-04-18-24-27-469.png! > > Expected: > !image-2019-03-04-18-24-54-053.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26922: Assignee: Hyukjin Kwon > Set socket timeout consistently in Arrow optimization > - > > Key: SPARK-26922 > URL: https://issues.apache.org/jira/browse/SPARK-26922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > For instance, see > https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184 > it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or > maybe we need another environment variable. > This might be able to be fixed together when some codes around there is > touched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26922) Set socket timeout consistently in Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26922. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23971 [https://github.com/apache/spark/pull/23971] > Set socket timeout consistently in Arrow optimization > - > > Key: SPARK-26922 > URL: https://issues.apache.org/jira/browse/SPARK-26922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 3.0.0 > > > For instance, see > https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184 > it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or > maybe we need another environment variable. > This might be able to be fixed together when some codes around there is > touched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26881: Assignee: (was: Apache Spark) > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > - > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Rafael RENAUDIN-AVINO >Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver
[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26881: Assignee: Apache Spark > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > - > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Rafael RENAUDIN-AVINO >Assignee: Apache Spark >Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.2
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784970#comment-16784970 ] Stavros Kontopoulos commented on SPARK-26742: - [~jiaxin] I think. > Bump Kubernetes Client Version to 4.1.2 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > Fix For: 3.0.0 > > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26727. - Resolution: Not A Bug > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at >
[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784956#comment-16784956 ] Xiao Li commented on SPARK-26727: - I resolved the ticket as "Not a bug". This is kind of a well-known issue. We are trying to implement a new Catalog API and data source API in Spark 3.x. These issues will be gone for the catalog/data sources that can guarantee atomicity. > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at >
[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests
[ https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784953#comment-16784953 ] shane knapp commented on SPARK-26775: - btw, once https://issues.apache.org/jira/browse/SPARK-26742 is taken care of, we can continue w/this. > Update Jenkins nodes to support local volumes for K8s integration tests > --- > > Key: SPARK-26775 > URL: https://issues.apache.org/jira/browse/SPARK-26775 > Project: Spark > Issue Type: Improvement > Components: jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: shane knapp >Priority: Major > > Current version of Minikube on test machines does not support properly the > local persistent volume feature required by this PR: > [https://github.com/apache/spark/pull/23514]. > We get his error: > "spec.local: Forbidden: Local volumes are disabled by feature-gate, > metadata.annotations: Required value: Local volume requires node affinity" > This is probably due to this: > [https://github.com/rancher/rancher/issues/13864] which implies that we need > to update to 1.10+ as described in > [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io > client is already updated in the PR mentioned at the beginning. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784945#comment-16784945 ] Stavros Kontopoulos commented on SPARK-18057: - Sure I will open a jira and take it from there. > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784949#comment-16784949 ] Xiao Li commented on SPARK-26727: - Hi, all, this could happen since the whole DDL are not atomic. For example, if the connection is broken after an attempt to create a table in hive metastore, we do not know the table has been created. Thus, we will still try to recreate the table. > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at >
[jira] [Updated] (SPARK-26742) Bump Kubernetes Client Version to 4.1.2
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-26742: Summary: Bump Kubernetes Client Version to 4.1.2 (was: Bump Kubernetes Client Version to 4.1.1) > Bump Kubernetes Client Version to 4.1.2 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > Fix For: 3.0.0 > > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784892#comment-16784892 ] Sean Owen commented on SPARK-27025: --- You'll want to cache() the thing you call toLocalIterator() on no matter what in this case. If it's not helping, then I think the delay remains the transferring of data to the driver, as it will all be computed and cached before you start. The 2-at-a-time implementation could help that and I'd be curious if it works out. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784914#comment-16784914 ] Sean Owen commented on SPARK-18057: --- [~skonto] go for it. I lost the context on this one but if we need to further update the Kafka client or clarify docs, that's good for Spark 3. > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784911#comment-16784911 ] shane knapp commented on SPARK-26742: - any idea of whom might be doing the PR to bump the client to 4.1.2? > Bump Kubernetes Client Version to 4.1.1 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > Fix For: 3.0.0 > > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher
[ https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27015. Resolution: Fixed Assignee: Martin Loncaric Fix Version/s: (was: 2.5.0) > spark-submit does not properly escape arguments sent to Mesos dispatcher > > > Key: SPARK-27015 > URL: https://issues.apache.org/jira/browse/SPARK-27015 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.3, 2.4.0 >Reporter: Martin Loncaric >Assignee: Martin Loncaric >Priority: Major > Fix For: 3.0.0 > > > Arguments sent to the dispatcher must be escaped; for instance, > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a > b$c"{noformat} > fails, and instead must be submitted as > {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ > b\\$c"{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900 ] Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:05 PM: - [~srowen] It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat timeout, it seems confusing right now why structured streaming does not require to set several parameters compared to the DStreams API. was (Author: skonto): [~srowen] It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat timeout. > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900 ] Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:04 PM: - [~srowen] It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat timeout. was (Author: skonto): [~srowen] It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat issue? > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900 ] Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:04 PM: - [~srowen] It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat issue? was (Author: skonto): [~srowen]It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat issue? > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900 ] Stavros Kontopoulos commented on SPARK-18057: - [~srowen]It seems the upgrade solves this issue: http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html. If so shouldnt we update the docs about the heartbeat issue? > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27021. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23930 [https://github.com/apache/spark/pull/23930] > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27021: -- Assignee: Attila Zsolt Piros > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue
[ https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784858#comment-16784858 ] Dongjoon Hyun commented on SPARK-26932: --- Thank you, [~haiboself]. I added you to Apache Spark contributor group. > Add a warning for Hive 2.1.1 ORC reader issue > - > > Key: SPARK-26932 > URL: https://issues.apache.org/jira/browse/SPARK-26932 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Bo Hai >Assignee: Bo Hai >Priority: Minor > Fix For: 2.4.2, 3.0.0 > > > As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer > and reader. In older version of Hive, orc reader(isn't forward-compitaient) > implemented by its own. > So Hive 2.2 and older can not read orc table created by spark 2.3 and newer > which using apache/orc instead of Hive orc. > I think we should add these information into Spark2.4 orc configuration file > : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784853#comment-16784853 ] Alessandro Bellina commented on SPARK-26944: [~shaneknapp] nice!! thank you > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: shane knapp >Priority: Minor > Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png > > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784852#comment-16784852 ] shane knapp commented on SPARK-26944: - added a glob to store these (see attached image). !Screen Shot 2019-03-05 at 12.08.43 PM.png! > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: shane knapp >Priority: Minor > Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png > > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784856#comment-16784856 ] shane knapp commented on SPARK-26944: - ill confirm that this works after the current PRB builds finish before closing this. > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: shane knapp >Priority: Minor > Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png > > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik van Oosten resolved SPARK-27025. - Resolution: Incomplete > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue
[ https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26932. --- Resolution: Fixed Assignee: Bo Hai Fix Version/s: 3.0.0 2.4.2 This is resolved via https://github.com/apache/spark/commit/c27caead43423d1f994f42502496d57ea8389dc0 . > Add a warning for Hive 2.1.1 ORC reader issue > - > > Key: SPARK-26932 > URL: https://issues.apache.org/jira/browse/SPARK-26932 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Bo Hai >Assignee: Bo Hai >Priority: Minor > Fix For: 2.4.2, 3.0.0 > > > As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer > and reader. In older version of Hive, orc reader(isn't forward-compitaient) > implemented by its own. > So Hive 2.2 and older can not read orc table created by spark 2.3 and newer > which using apache/orc instead of Hive orc. > I think we should add these information into Spark2.4 orc configuration file > : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784854#comment-16784854 ] Erik van Oosten commented on SPARK-27025: - If there is no obvious way to improve Spark, then its probably better to close this issue until someone finds a better angle. BTW, the cache/count/iterate/unpersist cycle did not make it faster for my use case. I will try the 2-partition implementation of toLocalIterator. [~srowen], [~hyukjin.kwon], thanks for your input! > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue
[ https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26932: -- Summary: Add a warning for Hive 2.1.1 ORC reader issue (was: Orc compatibility between hive and spark) > Add a warning for Hive 2.1.1 ORC reader issue > - > > Key: SPARK-26932 > URL: https://issues.apache.org/jira/browse/SPARK-26932 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: Bo Hai >Priority: Minor > > As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer > and reader. In older version of Hive, orc reader(isn't forward-compitaient) > implemented by its own. > So Hive 2.2 and older can not read orc table created by spark 2.3 and newer > which using apache/orc instead of Hive orc. > I think we should add these information into Spark2.4 orc configuration file > : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-26944: Attachment: Screen Shot 2019-03-05 at 12.08.43 PM.png > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: shane knapp >Priority: Minor > Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png > > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reassigned SPARK-26944: --- Assignee: shane knapp > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: shane knapp >Priority: Minor > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784843#comment-16784843 ] Bryan Cutler commented on SPARK-23961: -- I could also reproduce with a nearly identical error using the following {code} import time from pyspark.sql import SparkSession from pyspark.sql.functions import rand, udf from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("toLocalIterator_Test")\ .getOrCreate() df = spark.range(1 << 16).select(rand()) it = df.toLocalIterator() print(next(it)) it = None time.sleep(5) spark.stop() {code} I think there are a couple issues with the way this is currently working. When toLocalIterator is called in Python, the Scala side also creates a local iterator which immediately starts a loop to consume the entire iterator and write it all to Python without any synchronization with the Python iterator. Blocking the write operation only happens when the socket receive buffer is full. Small examples work fine if the data all fits in the read buffer, but the above code fails because the writing becomes blocked, then the Python iterator stops reading and closes the connection, which the Scala side sees as an error. I can work on a fix for this. > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Michel Lemay >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe and use toLocalIterator. If we do not consume all records, > it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder").cache() > print(df.count()) > b = df.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k
[ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi resolved SPARK-26947. -- Resolution: Invalid > Pyspark KMeans Clustering job fails on large values of k > > > Key: SPARK-26947 > URL: https://issues.apache.org/jira/browse/SPARK-26947 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > Attachments: clustering_app.py > > > We recently had a case where a user's pyspark job running KMeans clustering > was failing for large values of k. I was able to reproduce the same issue > with dummy dataset. I have attached the code as well as the data in the JIRA. > The stack trace is printed below from Java: > > {code:java} > Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649) > at java.lang.StringBuilder.append(StringBuilder.java:202) > at py4j.Protocol.getOutputCommand(Protocol.java:328) > at py4j.commands.CallCommand.execute(CallCommand.java:81) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > Python: > {code:java} > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 985, in send_command > response = connection.send_command(command) > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > Traceback (most recent call last): > File "clustering_app.py", line 154, in > main(args) > File "clustering_app.py", line 145, in main > run_clustering(sc, args.input_path, args.output_path, > args.num_clusters_list) > File "clustering_app.py", line 136, in run_clustering > clustersTable, cluster_Centers = clustering(sc, documents, output_path, > k, max_iter) > File "clustering_app.py", line 68, in clustering > cluster_Centers = km_model.clusterCenters() > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", > line 337, in clusterCenters > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", > line 55, in _call_java > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", > line 109, in _java2py > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", > line 336, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.ml.python.MLSerDe.dumps > {code} > The command with which the application was launched is given below: > {code:java} > $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf > spark.executor.memory=20g --conf spark.driver.memory=20g --conf > spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf > spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g > ~/clustering_app.py --input_path hdfs:///user/username/part-v001x > --output_path hdfs:///user/username --num_clusters_list 1 > {code} > The input dataset is approximately 90 MB in size and the assigned heap memory > to both driver and executor is close to 20 GB. This only happens for large > values of k. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k
[ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784831#comment-16784831 ] Parth Gandhi commented on SPARK-26947: -- [~srowen] Yes your suggestion to limit the vocab size helps. Closing this JIRA. Thank you. > Pyspark KMeans Clustering job fails on large values of k > > > Key: SPARK-26947 > URL: https://issues.apache.org/jira/browse/SPARK-26947 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.4.0 >Reporter: Parth Gandhi >Priority: Minor > Attachments: clustering_app.py > > > We recently had a case where a user's pyspark job running KMeans clustering > was failing for large values of k. I was able to reproduce the same issue > with dummy dataset. I have attached the code as well as the data in the JIRA. > The stack trace is printed below from Java: > > {code:java} > Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649) > at java.lang.StringBuilder.append(StringBuilder.java:202) > at py4j.Protocol.getOutputCommand(Protocol.java:328) > at py4j.commands.CallCommand.execute(CallCommand.java:81) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > Python: > {code:java} > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 985, in send_command > response = connection.send_command(command) > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > Traceback (most recent call last): > File "clustering_app.py", line 154, in > main(args) > File "clustering_app.py", line 145, in main > run_clustering(sc, args.input_path, args.output_path, > args.num_clusters_list) > File "clustering_app.py", line 136, in run_clustering > clustersTable, cluster_Centers = clustering(sc, documents, output_path, > k, max_iter) > File "clustering_app.py", line 68, in clustering > cluster_Centers = km_model.clusterCenters() > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", > line 337, in clusterCenters > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", > line 55, in _call_java > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", > line 109, in _java2py > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", > line 336, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.ml.python.MLSerDe.dumps > {code} > The command with which the application was launched is given below: > {code:java} > $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf > spark.executor.memory=20g --conf spark.driver.memory=20g --conf > spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf > spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g > ~/clustering_app.py --input_path hdfs:///user/username/part-v001x > --output_path hdfs:///user/username --num_clusters_list 1 > {code} > The input dataset is approximately 90 MB in size and the assigned heap memory > to both driver and executor is close to 20 GB. This only happens for large > values of k. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-27043) Add ORC nested schema pruning benchmarks
[ https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27043: -- Summary: Add ORC nested schema pruning benchmarks (was: Nested schema pruning benchmark for ORC) > Add ORC nested schema pruning benchmarks > > > Key: SPARK-27043 > URL: https://issues.apache.org/jira/browse/SPARK-27043 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We have benchmark of nested schema pruning, but only for Parquet. This adds > similar benchmark for ORC. This is used with nested schema pruning of ORC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27043) Nested schema pruning benchmark for ORC
[ https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27043. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23955 > Nested schema pruning benchmark for ORC > --- > > Key: SPARK-27043 > URL: https://issues.apache.org/jira/browse/SPARK-27043 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We have benchmark of nested schema pruning, but only for Parquet. This adds > similar benchmark for ORC. This is used with nested schema pruning of ORC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784791#comment-16784791 ] t oo commented on SPARK-26998: -- [~gsomogyi] please take it forward. [~kabhwan] truststore password being shown is not much of a problem since truststore is often distributed to users anyway. But keystore password still being shown is the big no-no. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13091) Rewrite/Propagate constraints for Aliases
[ https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784776#comment-16784776 ] Ajith S commented on SPARK-13091: - can this document be made accessible.? [https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze] > Rewrite/Propagate constraints for Aliases > - > > Key: SPARK-13091 > URL: https://issues.apache.org/jira/browse/SPARK-13091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal >Priority: Major > Fix For: 2.0.0 > > > We'd want to duplicate constraints when there is an alias (i.e. for "SELECT > a, a AS b", any constraints on a now apply to b) > This is a follow up task based on [~marmbrus]'s suggestion in > https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26928) Add driver CPU Time to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26928. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23838 [https://github.com/apache/spark/pull/23838] > Add driver CPU Time to the metrics system > - > > Key: SPARK-26928 > URL: https://issues.apache.org/jira/browse/SPARK-26928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > This proposes to add instrumentation for the driver's JVM CPU time via the > Spark Dropwizard/Codahale metrics system. It follows directly from previous > work SPARK-25228 and shares similar motivations: it is intended as an > improvement to be used for Spark performance dashboards and monitoring > tools/instrumentation. > Additionally this proposes a new configuration parameter > `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to > turn on the new feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26928) Add driver CPU Time to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26928: -- Assignee: Luca Canali > Add driver CPU Time to the metrics system > - > > Key: SPARK-26928 > URL: https://issues.apache.org/jira/browse/SPARK-26928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > > This proposes to add instrumentation for the driver's JVM CPU time via the > Spark Dropwizard/Codahale metrics system. It follows directly from previous > work SPARK-25228 and shares similar motivations: it is intended as an > improvement to be used for Spark performance dashboards and monitoring > tools/instrumentation. > Additionally this proposes a new configuration parameter > `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to > turn on the new feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27012) Storage tab shows rdd details even after executor ended
[ https://issues.apache.org/jira/browse/SPARK-27012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27012. Resolution: Fixed Assignee: Ajith S Fix Version/s: 3.0.0 > Storage tab shows rdd details even after executor ended > --- > > Key: SPARK-27012 > URL: https://issues.apache.org/jira/browse/SPARK-27012 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.3, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > > > After we cache a table, we can see its details in Storage Tab of spark UI. If > the executor has shutdown ( graceful shutdown/ Dynamic executor scenario) UI > still shows the rdd as cached and when we click the link it throws error. > This is because on executor remove event, we fail to adjust rdd partition > details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27059. Resolution: Invalid > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784762#comment-16784762 ] Marcelo Vanzin commented on SPARK-27059: Sounds like a problem with your system. Maybe your PATH has the wrong {{spark-submit}} in it. > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27067. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784758#comment-16784758 ] Andreas Adamides commented on SPARK-27059: -- Indeed, when in spark 2.4.0 and 2.3.3 running *spark-submit --version* returns "version 2.2.1" (as well as spark-shell) So if not from the official Spark Download Page, where would I download the latest advertised spark version that supports Kubernetes. > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Attachment: SPIP_ Spark API for Table Metadata.pdf > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Description: Goals: * Propose semantics for identifiers and a listing API to support multiple catalogs ** Support any namespace scheme used by an external catalog ** Avoid traversing namespaces via multiple listing calls from Spark * Outline migration from the current behavior to Spark with multiple catalogs > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > > Goals: > * Propose semantics for identifiers and a listing API to support multiple > catalogs > ** Support any namespace scheme used by an external catalog > ** Avoid traversing namespaces via multiple listing calls from Spark > * Outline migration from the current behavior to Spark with multiple catalogs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support
Ryan Blue created SPARK-27066: - Summary: SPIP: Identifiers for multi-catalog support Key: SPARK-27066 URL: https://issues.apache.org/jira/browse/SPARK-27066 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Description: Goal: Define a catalog API to create, alter, load, and drop tables > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata
Ryan Blue created SPARK-27067: - Summary: SPIP: Catalog API for table metadata Key: SPARK-27067 URL: https://issues.apache.org/jira/browse/SPARK-27067 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27066. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736 ] Ryan Blue commented on SPARK-23521: --- I've turned off commenting on the google doc to preserve its state, with the existing comments. I'm also adding a PDF of the final proposal to this issue. > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23521: -- Attachment: SPIP_ Standardize logical plans.pdf > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26602) Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chakravarthi updated SPARK-26602: - Summary: Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path (was: Insert into table fails after querying the UDF which is loaded with wrong hdfs path) > Subsequent queries are failing after querying the UDF which is loaded with > wrong hdfs path > -- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:53 PM: - Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire (if you dont use the local deamon to build stuff for whatever reason). Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. was (Author: skonto): Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos commented on SPARK-27063: - Yes some other things that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:52 PM: - Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. was (Author: skonto): Yes some other things that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784666#comment-16784666 ] Ajith S commented on SPARK-26727: - [~rigolaszlo] i see that from stacktrace ThriftHiveMetastore$Client is used which is a sync client for metrastore. Can you explain how you find that drop command is async.? > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319) > at >
[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784674#comment-16784674 ] Marcelo Vanzin commented on SPARK-27059: You're most probably using a version of Spark that does not support k8s. Try {{spark-submit --version}}. If it's 2.3 or later, check whether "spark-kubernetes*.jar" exists in the {{$SPARK_HOME/jars}} directory. > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784654#comment-16784654 ] Apache Spark commented on SPARK-27065: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23927 > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If _CatalogImpl.refreshTable()_ method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} CatalogImpl cache table with received _tableName_, instead of _tableIdent.table_ {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received _tableName_. Here is the proposed line of changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)){code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} Actually, CatalogImpl cache table with received table name, instead of only the table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received tableName. Here is the proposed changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} > CatalogImpl.refreshTable should register query in cache with
[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27065: Assignee: Wenchen Fan (was: Apache Spark) > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27065: Assignee: Apache Spark (was: Wenchen Fan) > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27065) avoid more than one active task set managers for a stage
Wenchen Fan created SPARK-27065: --- Summary: avoid more than one active task set managers for a stage Key: SPARK-27065 URL: https://issues.apache.org/jira/browse/SPARK-27065 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.0, 2.3.3 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27064) create StreamingWrite at the begining of streaming execution
Wenchen Fan created SPARK-27064: --- Summary: create StreamingWrite at the begining of streaming execution Key: SPARK-27064 URL: https://issues.apache.org/jira/browse/SPARK-27064 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution
[ https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27064: Assignee: Apache Spark (was: Wenchen Fan) > create StreamingWrite at the begining of streaming execution > > > Key: SPARK-27064 > URL: https://issues.apache.org/jira/browse/SPARK-27064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution
[ https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27064: Assignee: Wenchen Fan (was: Apache Spark) > create StreamingWrite at the begining of streaming execution > > > Key: SPARK-27064 > URL: https://issues.apache.org/jira/browse/SPARK-27064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} Actually, CatalogImpl cache table with received table name, instead of only the table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received tableName. Here is the proposed changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] and CatalogImpl register cache with received table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received table name instead. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table,
[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27063: Assignee: Apache Spark > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Summary: CatalogImpl.refreshTable should register query in cache with received tableName (was: Refresh Table command register table with table name only) > CatalogImpl.refreshTable should register query in cache with received > tableName > --- > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Minor > Labels: easyfix, pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27063: Assignee: (was: Apache Spark) > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Priority: Minor (was: Major) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Minor > Labels: easyfix, pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Labels: easyfix pull-request-available (was: easyfix) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Major > Labels: easyfix, pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611 ] Ajith S commented on SPARK-26602: - # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour) Please correct me if i am wrong cc [~srowen] > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27062: Assignee: (was: Apache Spark) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Major > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27062: Assignee: Apache Spark > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Assignee: Apache Spark >Priority: Major > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611 ] Ajith S edited comment on SPARK-26602 at 3/5/19 4:15 PM: - # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour, we get classnotfoundexception when the missing class is actually referred, until then JVM is happily running) Please correct me if i am wrong cc [~srowen] was (Author: ajithshetty): # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour) Please correct me if i am wrong cc [~srowen] > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
Rob Vesse created SPARK-27063: - Summary: Spark on K8S Integration Tests timeouts are too short for some test clusters Key: SPARK-27063 URL: https://issues.apache.org/jira/browse/SPARK-27063 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Rob Vesse As noted during development for SPARK-26729 there are a couple of integration test timeouts that are too short when running on slower clusters e.g. developers laptops, small CI clusters etc [~skonto] confirmed that he has also experienced this behaviour in the discussion on PR [PR 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] We should up the defaults of this timeouts as an initial step and longer term consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] and CatalogImpl register cache with received table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received table name instead. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] Therefore, I would like to propose aligning the behavior. Full table name should also be used in RefreshTable case. We should change the following line in CatalogImpl.refreshTable from {code:java}
[jira] [Created] (SPARK-27062) Refresh Table command register table with table name only
William Wong created SPARK-27062: Summary: Refresh Table command register table with table name only Key: SPARK-27062 URL: https://issues.apache.org/jira/browse/SPARK-27062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: William Wong If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] Therefore, I would like to propose aligning the behavior. Full table name should also be used in RefreshTable case. We should change the following line in CatalogImpl.refreshTable from {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.quotedString)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.
[ https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782840#comment-16782840 ] Sujith Chacko edited comment on SPARK-27036 at 3/5/19 3:49 PM: --- It seems to be the problem area is BroadcastExchangeExec in driver where as part of Future a particular job will be fired and collected data will be broadcasted. The main problem is system will submit the job and its respective stage/tasks through DAGScheduler, where the scheduler thread will schedule the respective events , In BroadcastExchangeExec when future time out happens respective exception will thrown but the jobs/task which is scheduled by the DAGScheduler as part of the action called in future will not be cancelled, I think we shall cancel the respective job to avoid running the same in background even after Future time out exception, this can help to terminate the job promptly when TimeOutException happens, this will also save the additional resources getting utilized even after timeout exception thrown from driver. I want to give an attempt to handle this issue, Any comments suggestions are welcome. cc [~b...@cloudera.com] [~hvanhovell] [~srowen] was (Author: s71955): It seems to be the problem area is BroadcastExchangeExec in driver where as part of Future a particular job will be fired and collected data will be broadcasted. The main problem is system will submit the job and its respective stage/tasks through DAGScheduler, where the scheduler thread will schedule the respective events , In BroadcastExchangeExec when future time out happens respective exception will thrown but the jobs/task which is scheduled by the DAGScheduler as part of the action called in future will not be cancelled, I think we shall cancel the respective job to avoid running the same in background even after Future time out exception, this can help to terminate the job promptly when TimeOutException happens, this will also save the additional resources getting utilized even after timeout exception thrown from driver. I want to give an attempt to handle this issue, Any comments suggestions are welcome. cc [~sro...@scient.com] [~b...@cloudera.com] [~hvanhovell] > Even Broadcast thread is timed out, BroadCast Job is not aborted. > - > > Key: SPARK-27036 > URL: https://issues.apache.org/jira/browse/SPARK-27036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Babulal >Priority: Minor > Attachments: image-2019-03-04-00-38-52-401.png, > image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png > > > During broadcast table job is execution if broadcast timeout > (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till > completion whereas it should abort on broadcast timeout. > Exception is thrown in console but Spark Job is still continue. > > !image-2019-03-04-00-39-38-779.png! > !image-2019-03-04-00-39-12-210.png! > > wait for some time > !image-2019-03-04-00-38-52-401.png! > !image-2019-03-04-00-34-47-884.png! > > How to Reproduce Issue > Option1 using SQL:- > create Table t1(Big Table,1M Records) > val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')"); > create Table t2(Small Table,100K records) > val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')"); > spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false) > spark.sql("set spark.sql.broadcastTimeout=2").show(false) > Run Below Query > spark.sql("create table s using parquet as select t1.* from csv_2 as > t1,csv_1 as t2 where
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784589#comment-16784589 ] Sujith Chacko commented on SPARK-27060: --- Yes, Quite surprising. In hive they are validating all the keyword but seems to be as per our SqlBase.g4 grammer we are accepting the rserved keywords. Will analyze more on this a raise an PR. let me know for any suggestions. Thanks > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27061) Expose 4040 port on driver service to access logs using service
Chandu Kavar created SPARK-27061: Summary: Expose 4040 port on driver service to access logs using service Key: SPARK-27061 URL: https://issues.apache.org/jira/browse/SPARK-27061 Project: Spark Issue Type: Task Components: Kubernetes Affects Versions: 2.4.0 Reporter: Chandu Kavar Currently, we can access the driver logs using {{kubectl port-forward 4040:4040}} mentioned in [https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui] We have users who submit spark jobs to Kubernetes, but they don't have access to the cluster. so, they can't user kubectl port-forward command. If we can expose 4040 port on driver service, we can easily relay these logs to UI using driver service and Nginx reverse proxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575 ] Sachin Ramachandra Setty edited comment on SPARK-27060 at 3/5/19 3:40 PM: -- I verified this issue with Spark 2.3.2 and Spark 2.4.0 versions was (Author: sachin1729): I verified this issue with 2.3.2 and 2.4.0 . > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27005) Design sketch: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555 ] Thomas Graves edited comment on SPARK-27005 at 3/5/19 3:40 PM: --- so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. I added comments to the google doc was (Author: tgraves): so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. > Design sketch: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:39 PM: - -Guys, is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): ~Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.~ Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575 ] Sachin Ramachandra Setty commented on SPARK-27060: -- I verified this issue with 2.3.2 and 2.4.0 . > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM: - -Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks. > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM: - ~Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.~ Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): -Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27060: -- Target Version/s: (was: 2.4.0) Priority: Minor (was: Major) Fix Version/s: (was: 2.3.2) (was: 2.4.0) Don't set Fix or Target Version. This isn't my area, but I agree it seems surprising if you can create a table called "CREATE". Please post your Spark reproduction and version though. > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784560#comment-16784560 ] Sachin Ramachandra Setty commented on SPARK-27060: -- cc [~srowen] > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27005) Design sketch: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555 ] Thomas Graves commented on SPARK-27005: --- so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. > Design sketch: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503 ] Sujith Chacko edited comment on SPARK-27060 at 3/5/19 3:11 PM: --- This looks like a compatibility issue with other engines. Will try to handle this cases. cc [~sro...@scient.com] [cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan] [~sro...@scient.com] [~sro...@yahoo.com] let us know for any suggestions. Thanks was (Author: s71955): This looks like a compatibility issue with other engines. Will try to handle this cases. cc [~sro...@scient.com] [cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan] let us know for any suggestions. Thanks > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784528#comment-16784528 ] Sean Owen commented on SPARK-26602: --- If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact. Something else will fail eventually. I understand you're saying, what if it doesn't affect some other UDFs? but I'm not sure we can know that. I would not make this change. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784526#comment-16784526 ] Chakravarthi commented on SPARK-26602: -- [~srowen] agree,but it should not make any other subsequent query (at least query which does not refer that UDF) to fail right? . Any insert or select on the existing table itself is failing. [~ajithshetty] Yes,it makes all the subsequent query to fail,not only the query which refers to that UDF. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org