date:20190305

[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785277#comment-16785277
 ] 

Jungtaek Lim commented on SPARK-26998:
--

[~toopt4]

Yeah I tend to agree that hiding more credential things are better so 
supportive on the change. Maybe I thought about the description of Jira issue 
your patch was originally landed.

Btw, are there any existing test or manual test to verify whether keystore 
password and key password are not used? Just curious, I honestly don't know 
about it.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)

2019-03-05 Thread TAESUK KIM (JIRA)

TAESUK KIM created SPARK-27069:
--

 Summary: Spark(2.3.1) LDA transfomation memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 Key: SPARK-27069
 URL: https://issues.apache.org/jira/browse/SPARK-27069
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Below is my environment

DataSet
 # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)

 # Word : about 3553918(can't change)

Spark environment
 # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)

 # executor-core,driver-core : 3

 # spark.serializer : default and 
org.apache.spark.serializer.KryoSerializer(both fail)

 # spark.executor.memoryOverhead : 18G --> 36G fail

Jave version : 1.8.0_191 (Oracle Corporation)

 
Reporter: TAESUK KIM


I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
version , ml ) using Spark 2.3.2(emr-5.18.0) .
After that I want to transform new DataSet by using that model. But when I 
transform new data, I alway get error related memory error.
I changed data size from x 0.1 , to x 0.01. But always get memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 
That hugeCapacity error(overflow) is happened when size of array is over 
Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why 
this error is happened.

And I want to change serializer to KryoSerializer. But I found 
this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
org.apache.spark.serializer.JavaSerializationStream even though I register 
KryoClasses
 

Is there any thing I can do ?

 
Below is code

 
{{val countvModel = CountVectorizerModel.load("s3://~/") }}
{{val ldaModel = DistributedLDAModel.load("s3://~/") }}
{{val transformeddata=countvModel.transform(inputData).select("productid", 
"itemid", "ptkString", "features") var featureldaDF = 
ldaModel.transform(transformeddata).select("productid", "itemid", 
"topicDistribution", "ptkString").toDF("productid", "itemid", "features", 
"ptkString") featureldaDF=featureldaDF.persist //this is 328 line }}
 

Other testing
 # Java option : UseParallelGC , UseG1GC (all fail)

Below is log
{{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
 at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
 at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
 at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
 at

[jira] [Created] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-03-05 Thread zhoukang (JIRA)

zhoukang created SPARK-27068:


 Summary: Support failed jobs ui and completed jobs ui use 
different queue
 Key: SPARK-27068
 URL: https://issues.apache.org/jira/browse/SPARK-27068
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0
Reporter: zhoukang


For some long running jobs,we may want to check out the cause of some failed 
jobs.
But most jobs has completed and failed jobs ui may disappear, we can use 
different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27045) SQL tab in UI shows callsite instead of actual SQL

2019-03-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785077#comment-16785077
 ] 

Dongjoon Hyun commented on SPARK-27045:
---

[~ajithshetty]. If this is not a regression at 2.3.2, we had better make this 
an `Improvement` issue.

> SQL tab in UI shows callsite instead of actual SQL
> --
>
> Key: SPARK-27045
> URL: https://issues.apache.org/jira/browse/SPARK-27045
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.3.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
> Attachments: image-2019-03-04-18-24-27-469.png, 
> image-2019-03-04-18-24-54-053.png
>
>
> When we run sql in spark ( for example via thrift server), the SparkUI SQL 
> tab must show SQL instead of stacktrace which is more useful to end user. 
> Instead in description column it currently shows the callsite shortform which 
> is less useful
>  Actual:
> !image-2019-03-04-18-24-27-469.png!
>  
> Expected:
> !image-2019-03-04-18-24-54-053.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-03-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26922:


Assignee: Hyukjin Kwon

> Set socket timeout consistently in Arrow optimization
> -
>
> Key: SPARK-26922
> URL: https://issues.apache.org/jira/browse/SPARK-26922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> For instance, see 
> https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184
> it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or 
> maybe we need another environment variable.
> This might be able to be fixed together when some codes around there is 
> touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26922) Set socket timeout consistently in Arrow optimization

2019-03-05 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26922.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23971
[https://github.com/apache/spark/pull/23971]

> Set socket timeout consistently in Arrow optimization
> -
>
> Key: SPARK-26922
> URL: https://issues.apache.org/jira/browse/SPARK-26922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.0
>
>
> For instance, see 
> https://github.com/apache/spark/blob/e8982ca7ad94e98d907babf2d6f1068b7cd064c6/R/pkg/R/context.R#L184
> it should set the timeout from {{SPARKR_BACKEND_CONNECTION_TIMEOUT}}. Or 
> maybe we need another environment variable.
> This might be able to be fixed together when some codes around there is 
> touched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26881:


Assignee: (was: Apache Spark)

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26881:


Assignee: Apache Spark

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Assignee: Apache Spark
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.2

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784970#comment-16784970
 ] 

Stavros Kontopoulos commented on SPARK-26742:
-

[~jiaxin] I think.

> Bump Kubernetes Client Version to 4.1.2
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26727.
-
Resolution: Not A Bug

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
>

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-05 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784956#comment-16784956
 ] 

Xiao Li commented on SPARK-26727:
-

I resolved the ticket as "Not a bug". This is kind of a well-known issue. We 
are trying to implement a new Catalog API and data source API in Spark 3.x. 
These issues will be gone for the catalog/data sources that can guarantee 
atomicity. 

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
>

[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-03-05 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784953#comment-16784953
 ] 

shane knapp commented on SPARK-26775:
-

btw, once https://issues.apache.org/jira/browse/SPARK-26742 is taken care of, 
we can continue w/this.

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784945#comment-16784945
 ] 

Stavros Kontopoulos commented on SPARK-18057:
-

Sure I will open a jira and take it from there. 

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-05 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784949#comment-16784949
 ] 

Xiao Li commented on SPARK-26727:
-

Hi, all, this could happen since the whole DDL are not atomic. For example, if 
the connection is broken after an attempt to create a table in hive metastore, 
we do not know the table has been created. Thus, we will still try to recreate 
the table. 

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
>

[jira] [Updated] (SPARK-26742) Bump Kubernetes Client Version to 4.1.2

2019-03-05 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26742:

Summary: Bump Kubernetes Client Version to 4.1.2  (was: Bump Kubernetes 
Client Version to 4.1.1)

> Bump Kubernetes Client Version to 4.1.2
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784892#comment-16784892
 ] 

Sean Owen commented on SPARK-27025:
---

You'll want to cache() the thing you call toLocalIterator() on no matter what 
in this case. If it's not helping, then I think the delay remains the 
transferring of data to the driver, as it will all be computed and cached 
before you start. The 2-at-a-time implementation could help that and I'd be 
curious if it works out.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784914#comment-16784914
 ] 

Sean Owen commented on SPARK-18057:
---

[~skonto] go for it. I lost the context on this one but if we need to further 
update the Kafka client or clarify docs, that's good for Spark 3.

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-03-05 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784911#comment-16784911
 ] 

shane knapp commented on SPARK-26742:
-

any idea of whom might be doing the PR to bump the client to 4.1.2?

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27015.

   Resolution: Fixed
 Assignee: Martin Loncaric
Fix Version/s: (was: 2.5.0)

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 3.0.0
>
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900
 ] 

Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:05 PM:
-

[~srowen] It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat timeout, it seems 
confusing right now why structured streaming does not require to set several 
parameters compared to the DStreams API.


was (Author: skonto):
[~srowen] It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat timeout.

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900
 ] 

Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:04 PM:
-

[~srowen] It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat timeout.


was (Author: skonto):
[~srowen] It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat issue?

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900
 ] 

Stavros Kontopoulos edited comment on SPARK-18057 at 3/5/19 9:04 PM:
-

[~srowen] It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat issue?


was (Author: skonto):
[~srowen]It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat issue?

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784900#comment-16784900
 ] 

Stavros Kontopoulos commented on SPARK-18057:
-

[~srowen]It seems the upgrade solves this issue: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-upgrading-Kafka-client-version-td21140.html.
 If so shouldnt we update the docs about the heartbeat issue?

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27021.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23930
[https://github.com/apache/spark/pull/23930]

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27021:
--

Assignee: Attila Zsolt Piros

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue

2019-03-05 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784858#comment-16784858
 ] 

Dongjoon Hyun commented on SPARK-26932:
---

Thank you, [~haiboself]. I added you to Apache Spark contributor group.

> Add a warning for Hive 2.1.1 ORC reader issue
> -
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Assignee: Bo Hai
>Priority: Minor
> Fix For: 2.4.2, 3.0.0
>
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-05 Thread Alessandro Bellina (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784853#comment-16784853
 ] 

Alessandro Bellina commented on SPARK-26944:


[~shaneknapp] nice!! thank you

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-05 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784852#comment-16784852
 ] 

shane knapp commented on SPARK-26944:
-

added a glob to store these (see attached image). !Screen Shot 2019-03-05 at 
12.08.43 PM.png! 

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-05 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784856#comment-16784856
 ] 

shane knapp commented on SPARK-26944:
-

ill confirm that this works after the current PRB builds finish before closing 
this.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator

2019-03-05 Thread Erik van Oosten (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Oosten resolved SPARK-27025.
-
Resolution: Incomplete

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue

2019-03-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26932.
---
   Resolution: Fixed
 Assignee: Bo Hai
Fix Version/s: 3.0.0
   2.4.2

This is resolved via 
https://github.com/apache/spark/commit/c27caead43423d1f994f42502496d57ea8389dc0 
.

> Add a warning for Hive 2.1.1 ORC reader issue
> -
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Assignee: Bo Hai
>Priority: Minor
> Fix For: 2.4.2, 3.0.0
>
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-05 Thread Erik van Oosten (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784854#comment-16784854
 ] 

Erik van Oosten commented on SPARK-27025:
-

If there is no obvious way to improve Spark, then its probably better to close 
this issue until someone finds a better angle.

BTW, the cache/count/iterate/unpersist cycle did not make it faster for my use 
case. I will try the 2-partition implementation of toLocalIterator.

[~srowen], [~hyukjin.kwon], thanks for your input!

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26932) Add a warning for Hive 2.1.1 ORC reader issue

2019-03-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26932:
--
Summary: Add a warning for Hive 2.1.1 ORC reader issue  (was: Orc 
compatibility between hive and spark)

> Add a warning for Hive 2.1.1 ORC reader issue
> -
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-05 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-26944:

Attachment: Screen Shot 2019-03-05 at 12.08.43 PM.png

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-03-05 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-26944:
---

Assignee: shane knapp

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-03-05 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784843#comment-16784843
 ] 

Bryan Cutler commented on SPARK-23961:
--

I could also reproduce with a nearly identical error using the following

{code}
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, udf
from pyspark.sql.types import *

spark = SparkSession\
.builder\
.appName("toLocalIterator_Test")\
.getOrCreate()

df = spark.range(1 << 16).select(rand())

it = df.toLocalIterator()

print(next(it))
it = None

time.sleep(5)
spark.stop()
{code}

I think there are a couple issues with the way this is currently working. When 
toLocalIterator is called in Python, the Scala side also creates a local 
iterator which immediately starts a loop to consume the entire iterator and 
write it all to Python without any synchronization with the Python iterator. 
Blocking the write operation only happens when the socket receive buffer is 
full.  Small examples work fine if the data all fits in the read buffer, but 
the above code fails because the writing becomes blocked, then the Python 
iterator stops reading and closes the connection, which the Scala side sees as 
an error.  I can work on a fix for this.

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-05 Thread Parth Gandhi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi resolved SPARK-26947.
--
Resolution: Invalid

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-05 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784831#comment-16784831
 ] 

Parth Gandhi commented on SPARK-26947:
--

[~srowen] Yes your suggestion to limit the vocab size helps. Closing this JIRA. 
Thank you.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-27043) Add ORC nested schema pruning benchmarks

2019-03-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27043:
--
Summary: Add ORC nested schema pruning benchmarks  (was: Nested schema 
pruning benchmark for ORC)

> Add ORC nested schema pruning benchmarks
> 
>
> Key: SPARK-27043
> URL: https://issues.apache.org/jira/browse/SPARK-27043
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have benchmark of nested schema pruning, but only for Parquet. This adds 
> similar benchmark for ORC. This is used with nested schema pruning of ORC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27043) Nested schema pruning benchmark for ORC

2019-03-05 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27043.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23955

> Nested schema pruning benchmark for ORC
> ---
>
> Key: SPARK-27043
> URL: https://issues.apache.org/jira/browse/SPARK-27043
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have benchmark of nested schema pruning, but only for Parquet. This adds 
> similar benchmark for ORC. This is used with nested schema pruning of ORC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784791#comment-16784791
 ] 

t oo commented on SPARK-26998:
--

[~gsomogyi] please take it forward.

[~kabhwan] truststore password being shown is not much of a problem since 
truststore is often distributed to users anyway. But keystore password still 
being shown is the big no-no.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13091) Rewrite/Propagate constraints for Aliases

2019-03-05 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784776#comment-16784776
 ] 

Ajith S commented on SPARK-13091:
-

can this document be made accessible.? 
[https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze]

> Rewrite/Propagate constraints for Aliases
> -
>
> Key: SPARK-13091
> URL: https://issues.apache.org/jira/browse/SPARK-13091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Major
> Fix For: 2.0.0
>
>
> We'd want to duplicate constraints when there is an alias (i.e. for "SELECT 
> a, a AS b", any constraints on a now apply to b)
> This is a follow up task based on [~marmbrus]'s suggestion in 
> https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26928) Add driver CPU Time to the metrics system

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26928.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23838
[https://github.com/apache/spark/pull/23838]

> Add driver CPU Time to the metrics system
> -
>
> Key: SPARK-26928
> URL: https://issues.apache.org/jira/browse/SPARK-26928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> This proposes to add instrumentation for the driver's JVM CPU time via the 
> Spark Dropwizard/Codahale metrics system. It follows directly from previous 
> work SPARK-25228 and shares similar motivations: it is intended as an 
> improvement to be used for Spark performance dashboards and monitoring 
> tools/instrumentation.
> Additionally this proposes a new configuration parameter 
> `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to 
> turn on the new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26928) Add driver CPU Time to the metrics system

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26928:
--

Assignee: Luca Canali

> Add driver CPU Time to the metrics system
> -
>
> Key: SPARK-26928
> URL: https://issues.apache.org/jira/browse/SPARK-26928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to add instrumentation for the driver's JVM CPU time via the 
> Spark Dropwizard/Codahale metrics system. It follows directly from previous 
> work SPARK-25228 and shares similar motivations: it is intended as an 
> improvement to be used for Spark performance dashboards and monitoring 
> tools/instrumentation.
> Additionally this proposes a new configuration parameter 
> `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to 
> turn on the new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27012) Storage tab shows rdd details even after executor ended

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27012.

   Resolution: Fixed
 Assignee: Ajith S
Fix Version/s: 3.0.0

> Storage tab shows rdd details even after executor ended
> ---
>
> Key: SPARK-27012
> URL: https://issues.apache.org/jira/browse/SPARK-27012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.3, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> After we cache a table, we can see its details in Storage Tab of spark UI. If 
> the executor has shutdown ( graceful shutdown/ Dynamic executor scenario) UI 
> still shows the rdd as cached and when we click the link it throws error. 
> This is because on executor remove event, we fail to adjust rdd partition 
> details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27059.

Resolution: Invalid

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784762#comment-16784762
 ] 

Marcelo Vanzin commented on SPARK-27059:


Sounds like a problem with your system. Maybe your PATH has the wrong 
{{spark-submit}} in it.

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27067.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784758#comment-16784758
 ] 

Andreas Adamides commented on SPARK-27059:
--

Indeed, when in spark 2.4.0 and 2.3.3 running

*spark-submit --version*

returns "version 2.2.1" (as well as spark-shell)

So if not from the official Spark Download Page, where would I download the 
latest advertised spark version that supports Kubernetes.

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Attachment: SPIP_ Spark API for Table Metadata.pdf

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Description: 
Goals:
 * Propose semantics for identifiers and a listing API to support multiple 
catalogs
 ** Support any namespace scheme used by an external catalog
 ** Avoid traversing namespaces via multiple listing calls from Spark
 * Outline migration from the current behavior to Spark with multiple catalogs

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>
> Goals:
>  * Propose semantics for identifiers and a listing API to support multiple 
> catalogs
>  ** Support any namespace scheme used by an external catalog
>  ** Avoid traversing namespaces via multiple listing calls from Spark
>  * Outline migration from the current behavior to Spark with multiple catalogs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-27066:
-

 Summary: SPIP: Identifiers for multi-catalog support
 Key: SPARK-27066
 URL: https://issues.apache.org/jira/browse/SPARK-27066
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Description: Goal: Define a catalog API to create, alter, load, and drop 
tables

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-27067:
-

 Summary: SPIP: Catalog API for table metadata
 Key: SPARK-27067
 URL: https://issues.apache.org/jira/browse/SPARK-27067
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27066.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736
 ] 

Ryan Blue commented on SPARK-23521:
---

I've turned off commenting on the google doc to preserve its state, with the 
existing comments. I'm also adding a PDF of the final proposal to this issue.

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23521:
--
Attachment: SPIP_ Standardize logical plans.pdf

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26602) Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chakravarthi updated SPARK-26602:
-
Summary: Subsequent queries are failing after querying the UDF which is 
loaded with wrong hdfs path  (was: Insert into table fails after querying the 
UDF which is loaded with wrong hdfs path)

> Subsequent queries are failing after querying the UDF which is loaded with 
> wrong hdfs path
> --
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:53 PM:
-

Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire (if you dont use the local deamon to build stuff for 
whatever reason).
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.



was (Author: skonto):
Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos commented on SPARK-27063:
-

Yes some other things that I noticed is when the images are pulled this may 
take time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:52 PM:
-

Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.



was (Author: skonto):
Yes some other things that I noticed is when the images are pulled this may 
take time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-05 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784666#comment-16784666
 ] 

Ajith S commented on SPARK-26727:
-

[~rigolaszlo] i see that from stacktrace ThriftHiveMetastore$Client is used 
which is a sync client for metrastore. Can you explain how you find that drop 
command is async.?

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
>  at 
>

[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784674#comment-16784674
 ] 

Marcelo Vanzin commented on SPARK-27059:


You're most probably using a version of Spark that does not support k8s. Try 
{{spark-submit --version}}.

If it's 2.3 or later, check whether "spark-kubernetes*.jar" exists in the 
{{$SPARK_HOME/jars}} directory.

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784654#comment-16784654
 ] 

Apache Spark commented on SPARK-27065:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23927

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If _CatalogImpl.refreshTable()_ method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 CatalogImpl cache table with received _tableName_, instead of 
_tableIdent.table_
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received _tableName_. Here is the proposed line of changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)){code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 Actually, CatalogImpl cache table with received table name, instead of only 
the table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received tableName. Here is the proposed changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 


> CatalogImpl.refreshTable should register query in cache with

[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27065:


Assignee: Wenchen Fan  (was: Apache Spark)

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27065:


Assignee: Apache Spark  (was: Wenchen Fan)

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27065:
---

 Summary: avoid more than one active task set managers for a stage
 Key: SPARK-27065
 URL: https://issues.apache.org/jira/browse/SPARK-27065
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.0, 2.3.3
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27064:
---

 Summary: create StreamingWrite at the begining of streaming 
execution
 Key: SPARK-27064
 URL: https://issues.apache.org/jira/browse/SPARK-27064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27064:


Assignee: Apache Spark  (was: Wenchen Fan)

> create StreamingWrite at the begining of streaming execution
> 
>
> Key: SPARK-27064
> URL: https://issues.apache.org/jira/browse/SPARK-27064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27064:


Assignee: Wenchen Fan  (was: Apache Spark)

> create StreamingWrite at the begining of streaming execution
> 
>
> Key: SPARK-27064
> URL: https://issues.apache.org/jira/browse/SPARK-27064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 Actually, CatalogImpl cache table with received table name, instead of only 
the table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received tableName. Here is the proposed changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

and  CatalogImpl register cache with received table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received table name instead. 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table,

[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27063:


Assignee: Apache Spark

> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Summary: CatalogImpl.refreshTable should register query in cache with 
received tableName  (was: Refresh Table command register table with table name 
only)

> CatalogImpl.refreshTable should register query in cache with received 
> tableName
> ---
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Minor
>  Labels: easyfix, pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27063:


Assignee: (was: Apache Spark)

> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Priority: Minor  (was: Major)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Minor
>  Labels: easyfix, pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Labels: easyfix pull-request-available  (was: easyfix)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Major
>  Labels: easyfix, pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611
 ] 

Ajith S commented on SPARK-26602:
-

# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour)

Please correct me if i am wrong  cc [~srowen]

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27062:


Assignee: (was: Apache Spark)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27062:


Assignee: Apache Spark

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Assignee: Apache Spark
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611
 ] 

Ajith S edited comment on SPARK-26602 at 3/5/19 4:15 PM:
-

# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour, we get classnotfoundexception when the 
missing class is actually referred, until then JVM is happily running)

Please correct me if i am wrong  cc [~srowen]


was (Author: ajithshetty):
# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour)

Please correct me if i am wrong  cc [~srowen]

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Rob Vesse (JIRA)

Rob Vesse created SPARK-27063:
-

 Summary: Spark on K8S Integration Tests timeouts are too short for 
some test clusters
 Key: SPARK-27063
 URL: https://issues.apache.org/jira/browse/SPARK-27063
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Rob Vesse


As noted during development for SPARK-26729 there are a couple of integration 
test timeouts that are too short when running on slower clusters e.g. 
developers laptops, small CI clusters etc

[~skonto] confirmed that he has also experienced this behaviour in the 
discussion on PR [PR 
23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]

We should up the defaults of this timeouts as an initial step and longer term 
consider making the timeouts themselves configurable




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

and  CatalogImpl register cache with received table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received table name instead. 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

 

 

Therefore, I would like to propose aligning the behavior. Full table name 
should also be used in RefreshTable case.  We should change the following line 
in CatalogImpl.refreshTable from 

 
{code:java}

[jira] [Created] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)

William Wong created SPARK-27062:


 Summary: Refresh Table command register table with table name only
 Key: SPARK-27062
 URL: https://issues.apache.org/jira/browse/SPARK-27062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: William Wong


If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

 

 

Therefore, I would like to propose aligning the behavior. Full table name 
should also be used in RefreshTable case.  We should change the following line 
in CatalogImpl.refreshTable from 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to

 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.quotedString))
 {code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-05 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782840#comment-16782840
 ] 

Sujith Chacko edited comment on SPARK-27036 at 3/5/19 3:49 PM:
---

It seems to be the problem area is   BroadcastExchangeExec  in driver where  as 
part of Future a particular job will be fired and collected data will be 
broadcasted. 

The main problem is system will submit the job and its respective stage/tasks 
through DAGScheduler,  where the scheduler thread will schedule the respective 
events , In BroadcastExchangeExec when future time out happens respective 
exception will thrown but the jobs/task which is  scheduled by  the  
DAGScheduler as part of the action called in future will not be cancelled, I 
think we shall cancel the respective job to avoid  running the same in  
background even after Future time out exception, this can help to terminate the 
job promptly when TimeOutException happens, this will also save the additional 
resources getting utilized even after timeout exception thrown from driver. 

I want to give an attempt to handle this issue, Any comments suggestions are 
welcome.

cc [~b...@cloudera.com] [~hvanhovell] [~srowen]


was (Author: s71955):
It seems to be the problem area is   BroadcastExchangeExec  in driver where  as 
part of Future a particular job will be fired and collected data will be 
broadcasted. 

The main problem is system will submit the job and its respective stage/tasks 
through DAGScheduler,  where the scheduler thread will schedule the respective 
events , In BroadcastExchangeExec when future time out happens respective 
exception will thrown but the jobs/task which is  scheduled by  the  
DAGScheduler as part of the action called in future will not be cancelled, I 
think we shall cancel the respective job to avoid  running the same in  
background even after Future time out exception, this can help to terminate the 
job promptly when TimeOutException happens, this will also save the additional 
resources getting utilized even after timeout exception thrown from driver. 

I want to give an attempt to handle this issue, Any comments suggestions are 
welcome.

cc [~sro...@scient.com] [~b...@cloudera.com] [~hvanhovell]

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
>  
> !image-2019-03-04-00-39-38-779.png!
> !image-2019-03-04-00-39-12-210.png!
>  
>  wait for some time
> !image-2019-03-04-00-38-52-401.png!
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
>  create Table t1(Big Table,1M Records)
>  val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
>  val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
>  spark.sql("set spark.sql.broadcastTimeout=2").show(false)
>  Run Below Query 
>  spark.sql("create table s using parquet as select t1.* from csv_2 as 
> t1,csv_1 as t2 where

[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784589#comment-16784589
 ] 

Sujith Chacko commented on SPARK-27060:
---

Yes, Quite surprising. In hive they are validating all the keyword but seems to 
be as per our SqlBase.g4 grammer we are accepting the rserved keywords.

Will analyze more on this a raise an PR. let me know for any suggestions. Thanks

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27061) Expose 4040 port on driver service to access logs using service

2019-03-05 Thread Chandu Kavar (JIRA)

Chandu Kavar created SPARK-27061:


 Summary: Expose 4040 port on driver service to access logs using 
service
 Key: SPARK-27061
 URL: https://issues.apache.org/jira/browse/SPARK-27061
 Project: Spark
  Issue Type: Task
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Chandu Kavar


Currently, we can access the driver logs using 

{{kubectl port-forward  4040:4040}}

mentioned in 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui]

We have users who submit spark jobs to Kubernetes, but they don't have access 
to the cluster. so, they can't user kubectl port-forward command.

If we can expose 4040 port on driver service, we can easily relay these logs to 
UI using driver service and Nginx reverse proxy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575
 ] 

Sachin Ramachandra Setty edited comment on SPARK-27060 at 3/5/19 3:40 PM:
--

I verified this issue with Spark 2.3.2 and Spark 2.4.0 versions


was (Author: sachin1729):
I verified this issue with 2.3.2 and 2.4.0 .

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27005) Design sketch: Accelerator-aware scheduling

2019-03-05 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555
 ] 

Thomas Graves edited comment on SPARK-27005 at 3/5/19 3:40 PM:
---

so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.  I added comments 
to the google doc


was (Author: tgraves):
so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.

> Design sketch: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:39 PM:
-

-Guys, is there a workaround for the folks that can't upgrade Spark version? 
Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
~Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.~

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575
 ] 

Sachin Ramachandra Setty commented on SPARK-27060:
--

I verified this issue with 2.3.2 and 2.4.0 .

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM:
-

-Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM:
-

~Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.~

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
-Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27060:
--
Target Version/s:   (was: 2.4.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 2.3.2)
  (was: 2.4.0)

Don't set Fix or Target Version.
This isn't my area, but I agree it seems surprising if you can create a table 
called "CREATE".
Please post your Spark reproduction and version though.

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784560#comment-16784560
 ] 

Sachin Ramachandra Setty commented on SPARK-27060:
--

cc [~srowen] 

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27005) Design sketch: Accelerator-aware scheduling

2019-03-05 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555
 ] 

Thomas Graves commented on SPARK-27005:
---

so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.

> Design sketch: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503
 ] 

Sujith Chacko edited comment on SPARK-27060 at 3/5/19 3:11 PM:
---

This looks like a compatibility issue with other engines. Will try to handle 
this cases.

cc [~sro...@scient.com] 
[cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan]
 [~sro...@scient.com] [~sro...@yahoo.com] let us know for any suggestions. 
Thanks


was (Author: s71955):
This looks like a compatibility issue with other engines. Will try to handle 
this cases.

cc [~sro...@scient.com] 
[cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan]
 let us know for any suggestions. Thanks

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784528#comment-16784528
 ] 

Sean Owen commented on SPARK-26602:
---

If a user adds something to the classpath, it matters to the whole classpath. 
If it's missing, I think it's surprising to ignore that fact. Something else 
will fail eventually. I understand you're saying, what if it doesn't affect 
some other UDFs? but I'm not sure we can know that. I would not make this 
change.

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784526#comment-16784526
 ] 

Chakravarthi commented on SPARK-26602:
--

[~srowen] agree,but it should not make any other subsequent query (at least 
query which does not refer that UDF) to fail right? . Any insert  or select on 
the existing table itself is failing. 

[~ajithshetty] Yes,it makes all the subsequent query to fail,not only the query 
which refers to that UDF.

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 140 matches

Mail list logo