[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785277#comment-16785277
 ] 

Jungtaek Lim commented on SPARK-26998:
--

[~toopt4]

Yeah I tend to agree that hiding more credential things are better so 
supportive on the change. Maybe I thought about the description of Jira issue 
your patch was originally landed.

Btw, are there any existing test or manual test to verify whether keystore 
password and key password are not used? Just curious, I honestly don't know 
about it.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)

2019-03-05 Thread TAESUK KIM (JIRA)
TAESUK KIM created SPARK-27069:
--

 Summary: Spark(2.3.1) LDA transfomation memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 Key: SPARK-27069
 URL: https://issues.apache.org/jira/browse/SPARK-27069
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Below is my environment

DataSet
 # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)

 # Word : about 3553918(can't change)

Spark environment
 # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)

 # executor-core,driver-core : 3

 # spark.serializer : default and 
org.apache.spark.serializer.KryoSerializer(both fail)

 # spark.executor.memoryOverhead : 18G --> 36G fail

Jave version : 1.8.0_191 (Oracle Corporation)

 
Reporter: TAESUK KIM


I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
version , ml ) using Spark 2.3.2(emr-5.18.0) .
After that I want to transform new DataSet by using that model. But when I 
transform new data, I alway get error related memory error.
I changed data size from x 0.1 , to x 0.01. But always get memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 
That hugeCapacity error(overflow) is happened when size of array is over 
Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why 
this error is happened.

And I want to change serializer to KryoSerializer. But I found 
this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
org.apache.spark.serializer.JavaSerializationStream even though I register 
KryoClasses
 

Is there any thing I can do ?

 
Below is code

 
{{val countvModel = CountVectorizerModel.load("s3://~/") }}
{{val ldaModel = DistributedLDAModel.load("s3://~/") }}
{{val transformeddata=countvModel.transform(inputData).select("productid", 
"itemid", "ptkString", "features") var featureldaDF = 
ldaModel.transform(transformeddata).select("productid", "itemid", 
"topicDistribution", "ptkString").toDF("productid", "itemid", "features", 
"ptkString") featureldaDF=featureldaDF.persist //this is 328 line }}
 

Other testing
 # Java option : UseParallelGC , UseG1GC (all fail)

Below is log
{{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
 at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
 at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
 at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
 at 

[jira] [Created] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread liuxian (JIRA)
liuxian created SPARK-27056:
---

 Summary: Remove  `start-shuffle-service.sh`
 Key: SPARK-27056
 URL: https://issues.apache.org/jira/browse/SPARK-27056
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 3.0.0
Reporter: liuxian


_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than start-shuffle-service.sh.
So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23339) Spark UI not loading *.js/*.css files, only raw HTML

2019-03-05 Thread gary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784204#comment-16784204
 ] 

gary commented on SPARK-23339:
--

exclude servlet-api-2.5.jar, use servlet-api-3.1.0.jar. It works for me.

> Spark UI not loading *.js/*.css files, only raw HTML
> 
>
> Key: SPARK-23339
> URL: https://issues.apache.org/jira/browse/SPARK-23339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, YARN, 2 Ubuntu 16.04 nodes, openjdk 
> 1.8.0_151
>Reporter: Erik Baumert
>Priority: Major
> Attachments: 3LCeC.png
>
>
> I have never reported anything before, and hope this is the right place as I 
> think I have come across a bug. If I missed the solution, please feel free to 
> correct me.
> I set up Spark 2.2.0 on a 2-node Ubuntu cluster. I use Jupyter notebook to 
> access the pyspark-shell. However, the UI via 
> [http://IP:4040/|http://ip:4040/] is broken. Has anyone ever seen something 
> like this?
> When I inspect the page in Chrome, it says "Failed to load resource: 
> net::ERR_EMPTY_RESPONSE" for various .js and .css files.
> I did a fresh install and added my configurations until the problem occurred 
> again. Everything works fine until I edited the spark-defaults.conf to 
> contain the following line:
> spark.driver.extraClassPath 
> /usr/local/phoenix/phoenix-4.13.0-HBase-1.3-client.jar 
> spark.executor.extraClassPath 
> /usr/local/phoenix/phoenix-4.13.0-HBase-1.3-client.jar 
> How to add these jar to my class path without breaking the UI? If I just 
> supply them using the --jars parameter in the Terminal it works fine. But I'd 
> like to have them configured, as explained in the manual: 
> [https://phoenix.apache.org/phoenix_spark.html]
>  
> I posted the question on Stackoverflow some time ago 
> [here|https://stackoverflow.com/questions/47291547/spark-ui-fails-to-load-js-displays-bare-html]
>  and apparently I'm not the only one 
> ([here|https://stackoverflow.com/questions/47875064/spark-ui-appears-with-wrong-format]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27055) Update Structured Streaming documentation because of DSv2 changes

2019-03-05 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-27055:
-

 Summary: Update Structured Streaming documentation because of DSv2 
changes
 Key: SPARK-27055
 URL: https://issues.apache.org/jira/browse/SPARK-27055
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


Since SPARK-26956 has been merged the Structured Streaming documentation has to 
be updated also to reflect the changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27056:


Assignee: (was: Apache Spark)

> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27056:


Assignee: Apache Spark

> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-05 Thread Shahid K I (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784194#comment-16784194
 ] 

Shahid K I edited comment on SPARK-27019 at 3/5/19 7:59 AM:


Please upload the screenshot of the sql page of the second scenario. I don't 
think in that case it will display like that. The issue happens only when the 
new live execution data is overwritten by the existing one


was (Author: shahid):
Please show me the screenshot of the sql page of the second scenario. I don't 
think in that case it will display like that. The issue happens only when the 
new live execution data is overwritten by the existing one

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2019-03-05 Thread Martin Studer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784200#comment-16784200
 ] 

Martin Studer commented on SPARK-20415:
---

I'm observing a similar issue where all executor tasks would hang in the 
following state:

{noformat}
org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:363)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$5.apply(GenerateExec.scala:120)
org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$5.apply(GenerateExec.scala:118)
scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
org.apache.spark.scheduler.Task.run(Task.scala:108)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{noformat}

This is with Spark 2.2.0.

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> 

[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-05 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784260#comment-16784260
 ] 

peay commented on SPARK-27019:
--

Yes, I had edited my message above shortly after posting - cannot reproduce in 
the second scenario. Thanks!

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable

2019-03-05 Thread Sandeep Katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta resolved SPARK-26850.
---
Resolution: Duplicate

> Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
> ---
>
> Key: SPARK-26850
> URL: https://issues.apache.org/jira/browse/SPARK-26850
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.2, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> private[spark] object EventLoggingListener extends Logging {
> ...
> private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", 
> 8).toShort)
> ...
> }
>  
> Currently the event log files are hard-coded with permission 770.
> It would be fine if this permission is +configurable+.
> User case: The spark application is submitted by user A but the spark history 
> server is started by user B. Currently user B cannot access the history event 
> files created by user A. When permission is set to 775, this will be possible.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27056:

Description: 
_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than _start-shuffle-service.sh_.
So now we should delete _start-shuffle-service.sh_ in case users use it.

  was:
_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than start-shuffle-service.sh.
So now we should delete _start-shuffle-service.sh_ in case users use it.


> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chakravarthi updated SPARK-26602:
-
Attachment: beforeFixUdf.txt

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27055) Update Structured Streaming documentation because of DSv2 changes

2019-03-05 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-27055:
--
Description: 
Since SPARK-26956 has been merged the Structured Streaming documentation has to 
be updated also to reflect the changes.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

  was:Since SPARK-26956 has been merged the Structured Streaming documentation 
has to be updated also to reflect the changes.


> Update Structured Streaming documentation because of DSv2 changes
> -
>
> Key: SPARK-27055
> URL: https://issues.apache.org/jira/browse/SPARK-27055
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Since SPARK-26956 has been merged the Structured Streaming documentation has 
> to be updated also to reflect the changes.
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27057) Common trait for limit exec operators

2019-03-05 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27057:
--

 Summary: Common trait for limit exec operators
 Key: SPARK-27057
 URL: https://issues.apache.org/jira/browse/SPARK-27057
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the 
UnaryExecNode trait as the common trait. It is slightly inconvenient to 
distinguish those operators from others. The ticket aims to introduce new trait 
for all 3 operators. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)
Andreas Adamides created SPARK-27059:


 Summary: spark-submit on kubernetes cluster does not recognise k8s 
--master property
 Key: SPARK-27059
 URL: https://issues.apache.org/jira/browse/SPARK-27059
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0, 2.3.3
Reporter: Andreas Adamides


I have successfully installed a Kubernetes cluster and can verify this by:

 

 

{{C:\windows\system32>kubectl cluster-info Kubernetes master is running at 
https://: KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}

 

 

Then I am trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

 

 

{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}

 

 

I am getting this error:

 

 

{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}

 

 

I also tried:

 

 

{{spark-submit --help}}

 

 

to see what I can get regarding the *--master* property. This is what I get:

 

 

{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}

 

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134
 ] 

Chakravarthi edited comment on SPARK-26602 at 3/5/19 2:31 PM:
--

Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue 
is,Insert into table fails after querying the UDF which is loaded with wrong 
hdfs path.

Below are the steps to reproduce this issue:

1) create a table.
sql("create table table1(I int)");

2) create udf using invalid hdfs path.
sql("CREATE FUNCTION before_fix  AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/notexist.jar'")

3) Do select on the UDF  and you will get exception as "Failed to read external 
resource".
 sql(" select  before_fix('2018-03-09')").

4) perform insert table or select on any table.It will fail. 
 sql("insert into  table1 values(1)").show
 sql("select * from table1 ").show

Here ,insert should work.but is fails.











was (Author: chakravarthi):
Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue 
is,Insert into table fails after querying the UDF which is loaded with wrong 
hdfs path.

Below are the steps to reproduce this issue:

1) create a table.
sql("create table table1(I int)");

2) create udf using invalid hdfs path.
sql("CREATE FUNCTION before_fix  AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/notexist.jar'")

3) Do select on the UDF  and you will get exception as "Failed to read external 
resource".
 sql(" select  before_fix('2018-03-09')").

4) perform insert table. 
 sql("insert into  table1 values(1)").show

Here ,insert should work.but is fails.










> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sujith Chacko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503
 ] 

Sujith Chacko commented on SPARK-27060:
---

This looks like a compatibility issue with other engines. Will try to handle 
this cases.

cc [~sro...@scient.com] 
[cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan]
 let us know for any suggestions. Thanks

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27058) Support mounting host dirs for K8s tests

2019-03-05 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784396#comment-16784396
 ] 

Stavros Kontopoulos commented on SPARK-27058:
-

[~shaneknapp] added this to keep track of things, as you requested.

> Support mounting host dirs for K8s tests
> 
>
> Key: SPARK-27058
> URL: https://issues.apache.org/jira/browse/SPARK-27058
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> According to the discussion 
> [here|https://github.com/apache/spark/pull/23514], supporting PVs tests 
> requires mounting a host dir.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27058) Support mounting host dirs for K8s tests

2019-03-05 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-27058:
---

 Summary: Support mounting host dirs for K8s tests
 Key: SPARK-27058
 URL: https://issues.apache.org/jira/browse/SPARK-27058
 Project: Spark
  Issue Type: Improvement
  Components: jenkins
Affects Versions: 3.0.0
Reporter: Stavros Kontopoulos


According to the discussion [here|https://github.com/apache/spark/pull/23514], 
supporting PVs tests requires mounting a host dir.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Adamides updated SPARK-27059:
-
Description: 
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
{{*Kubernetes master is running at https://:* }}
{{ *{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}}

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

 

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 

  was:
I have successfully installed a Kubernetes cluster and can verify this by:


{{C:\windows\system32>kubectl cluster-info }}
{{Kubernetes master is running at https://: }}
{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}

 

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)


{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}

 

I am getting this error:

 

{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}

 

I also tried:

 

{{spark-submit --help}}

 

to see what I can get regarding the *--master* property. This is what I get:

 

{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 


> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
> {{*Kubernetes master is running at https://:* }}
> {{ *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}}
> Trying to run the SparkPi with the Spark I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
>  
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> 

[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784434#comment-16784434
 ] 

Jungtaek Lim commented on SPARK-26998:
--

If I understand correctly, the PR would mitigate the issue (remove some of 
unnecessary password parameters being passed) but not completely solve the 
issue, sine truststore password parameters will be still passed as it was.

To handle issue correctly we need to have secured storage to share the security 
information.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784446#comment-16784446
 ] 

Sean Owen commented on SPARK-26602:
---

That sounds like user error. I'd close this as NotAProblem. It will cause a 
less-clear error later anyway

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes commented on SPARK-23986:
-

Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784506#comment-16784506
 ] 

Ajith S commented on SPARK-26602:
-

[~chakravarthi] Hi, thanks for reporting the issue, From your example it looks 
like a missing jar will cause any subsequent sqls (*sqls which do not refer to 
UDF*) also to fail in this session. Right.? cc [~srowen]

 

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base

2019-03-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26923:
-
Summary: Refactor ArrowRRunner and RRunner to share the same base  (was: 
Refactor ArrowRRunner and RRunner to deduplicate codes)

> Refactor ArrowRRunner and RRunner to share the same base
> 
>
> Key: SPARK-26923
> URL: https://issues.apache.org/jira/browse/SPARK-26923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> ArrowRRunner and RRunner has already duplicated codes. We should refactor and 
> deduplicate them. Also, ArrowRRunner happened to have a rather hacky code 
> (see 
> https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61
> ).
> We might even be able to deduplicate some codes with PythonRunners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784391#comment-16784391
 ] 

Gabor Somogyi edited comment on SPARK-26998 at 3/5/19 12:51 PM:


[~toopt4] thanks for the info. Are you working on this? If not I'm happy to 
push the solution forward.


was (Author: gsomogyi):
[~toopt4] thanks for the info. Are you working on this? If not happy to pushing 
the solution forward.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27057) Common trait for limit exec operators

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27057:


Assignee: Apache Spark

> Common trait for limit exec operators
> -
>
> Key: SPARK-27057
> URL: https://issues.apache.org/jira/browse/SPARK-27057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the 
> UnaryExecNode trait as the common trait. It is slightly inconvenient to 
> distinguish those operators from others. The ticket aims to introduce new 
> trait for all 3 operators. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27057) Common trait for limit exec operators

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27057:


Assignee: (was: Apache Spark)

> Common trait for limit exec operators
> -
>
> Key: SPARK-27057
> URL: https://issues.apache.org/jira/browse/SPARK-27057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the 
> UnaryExecNode trait as the common trait. It is slightly inconvenient to 
> distinguish those operators from others. The ticket aims to introduce new 
> trait for all 3 operators. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26923:


Assignee: Apache Spark

> Refactor ArrowRRunner and RRunner to share the same base
> 
>
> Key: SPARK-26923
> URL: https://issues.apache.org/jira/browse/SPARK-26923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> ArrowRRunner and RRunner has already duplicated codes. We should refactor and 
> deduplicate them. Also, ArrowRRunner happened to have a rather hacky code 
> (see 
> https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61
> ).
> We might even be able to deduplicate some codes with PythonRunners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2019-03-05 Thread Martin Studer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784475#comment-16784475
 ] 

Martin Studer commented on SPARK-20415:
---

In fact, I believe the underlying issue is actually SPARK-21657. To be 
confirmed.

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
> (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has 
> no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in 
> memory (estimated size 170.5 KB, 

[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sujith Chacko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503
 ] 

Sujith Chacko edited comment on SPARK-27060 at 3/5/19 3:11 PM:
---

This looks like a compatibility issue with other engines. Will try to handle 
this cases.

cc [~sro...@scient.com] 
[cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan]
 [~sro...@scient.com] [~sro...@yahoo.com] let us know for any suggestions. 
Thanks


was (Author: s71955):
This looks like a compatibility issue with other engines. Will try to handle 
this cases.

cc [~sro...@scient.com] 
[cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan]
 let us know for any suggestions. Thanks

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Adamides updated SPARK-27059:
-
Description: 
I have successfully installed a Kubernetes cluster and can verify this by:


{{C:\windows\system32>kubectl cluster-info }}
{{Kubernetes master is running at https://: }}
{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}

 

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)


{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}

 

I am getting this error:

 

{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}

 

I also tried:

 

{{spark-submit --help}}

 

to see what I can get regarding the *--master* property. This is what I get:

 

{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 

  was:
I have successfully installed a Kubernetes cluster and can verify this by:

 

 

{{C:\windows\system32>kubectl cluster-info Kubernetes master is running at 
https://: KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}

 

 

Then I am trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

 

 

{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}

 

 

I am getting this error:

 

 

{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}

 

 

I also tried:

 

 

{{spark-submit --help}}

 

 

to see what I can get regarding the *--master* property. This is what I get:

 

 

{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}

 

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 


> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
> {{Kubernetes master is running at https://: }}
> {{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}
>  
> Trying to run the SparkPi with the Spark I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> {{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}
>  
> I am getting this error:
>  
> {{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}
>  
> I also tried:
>  
> {{spark-submit --help}}
>  
> to see what I can get regarding the *--master* property. This is what I get:
>  
> {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> 

[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Adamides updated SPARK-27059:
-
Description: 
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
 {{*Kubernetes master is running at https://:* }}
 *{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 

  was:
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
 {{*Kubernetes master is running at https://:* }}
*{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

 

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 


> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> 

[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-03-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784453#comment-16784453
 ] 

Sean Owen commented on SPARK-26972:
---

I haven't checked when it's case-insensitive, but to be clear: you should test 
vs master, with a correct setting of multiLine and lineSep to correctly parse 
this. So far these seem to explain all the behavior you see.

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
> And this Java code:
> {code:java}
> Dataset df = spark.read().format("csv")
>  .option("header", "true")
>  .option("multiline", true)
>  .option("sep", ";")
>  .option("quote", "*")
>  .option("dateFormat", "M/d/y")
>  .option("inferSchema", true)
>  .load("data/books.csv");
> df.show(7);
> df.printSchema();
> {code}
> h1. In Spark v2.0.1
> Output: 
> {noformat}
> +---+++---++
> | id|authorId|   title|releaseDate|link|
> +---+++---++
> |  1|   1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
> |  2|   1|Harry Potter and ...|10/6/15|http://amzn.to/2l...|
> |  3|   1|The Tales of Beed...|12/4/08|http://amzn.to/2k...|
> |  4|   1|Harry Potter and ...|10/4/16|http://amzn.to/2k...|
> |  5|   2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...|
> |  6|   2|Development Tools...|   12/28/16|http://amzn.to/2v...|
> |  7|   3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
> +---+++---++
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true)
> {noformat}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content: 
> {noformat}
> ++++---++
> | id|authorId| title|releaseDate| link|
> ++++---++
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> 

[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784560#comment-16784560
 ] 

Sachin Ramachandra Setty commented on SPARK-27060:
--

cc [~srowen] 

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26923:


Assignee: (was: Apache Spark)

> Refactor ArrowRRunner and RRunner to share the same base
> 
>
> Key: SPARK-26923
> URL: https://issues.apache.org/jira/browse/SPARK-26923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> ArrowRRunner and RRunner has already duplicated codes. We should refactor and 
> deduplicate them. Also, ArrowRRunner happened to have a rather hacky code 
> (see 
> https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61
> ).
> We might even be able to deduplicate some codes with PythonRunners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Adamides updated SPARK-27059:
-
Description: 
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
 {{*Kubernetes master is running at https://:* }}
 *{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*

Trying to run the SparkPi with the Spark release I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 

  was:
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
 {{*Kubernetes master is running at https://:* }}
 *{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 


> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> 

[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784391#comment-16784391
 ] 

Gabor Somogyi commented on SPARK-26998:
---

[~toopt4] thanks for the info. Are you working on this? If not happy to pushing 
the solution forward.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784406#comment-16784406
 ] 

Gabor Somogyi commented on SPARK-26998:
---

{quote}
Can be resolved if below PR is merged:

[[Github] Pull Request #21514 
(tooptoop4)|https://github.com/apache/spark/pull/21514]
{quote}
I think it's just not true. #21514 is solving a UI problem where an application 
'name' urls point to http instead of https (even when ssl enabled).
Have I missed something?


> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784410#comment-16784410
 ] 

Gabor Somogyi commented on SPARK-26998:
---

Ahaaa, I see now. 2 problems tried to be solved in one PR.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Adamides updated SPARK-27059:
-
Description: 
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
 {{*Kubernetes master is running at https://:* }}
*{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

 

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 

  was:
I have successfully installed a Kubernetes cluster and can verify this by:

{{C:\windows\system32>kubectl cluster-info }}
{{*Kubernetes master is running at https://:* }}
{{ *{{KubeDNS is running at 
https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}}

Trying to run the SparkPi with the Spark I downloaded from 
[https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)

*{{spark-submit --master k8s://https://: --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=2 --conf 
spark.kubernetes.container.image=gettyimages/spark 
c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*

 

I am getting this error:

*{{Error: Master must either be yarn or start with spark, mesos, local Run with 
--help for usage help or --verbose for debug output}}*

I also tried:

*{{spark-submit --help}}*

to see what I can get regarding the *--master* property. This is what I get:

*{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}*

 

According to the documentation 
[[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running 
Spark workloads in Kubernetes, spark-submit does not even seem to recognise the 
k8s value for master. [ included in possible Spark masters: 
[https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] 
]

 


> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
> *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
>  
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> 

[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134
 ] 

Chakravarthi edited comment on SPARK-26602 at 3/5/19 1:52 PM:
--

Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue 
is,Insert into table fails after querying the UDF which is loaded with wrong 
hdfs path.

Below are the steps to reproduce this issue:

1) create a table.
sql("create table table1(I int)");

2) create udf using invalid hdfs path.
sql("CREATE FUNCTION before_fix  AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/notexist.jar'")

3) Do select on the UDF  and you will get exception as "Failed to read external 
resource".
 sql(" select  before_fix('2018-03-09')").

4) perform insert table. 
 sql("insert into  table1 values(1)").show

Here ,insert should work.but is fails.











was (Author: chakravarthi):
Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue 
is,Insert into table fails after querying the UDF which is loaded with wrong 
hdfs path.

Below are the steps to reproduce this issue:

1) create a table.
sql("create table check_udf(I int)");

2) create udf using invalid hdfs path.
sql("CREATE FUNCTION before_fix  AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/notexist.jar'")

3) Do select on the UDF  and you will get exception as "Failed to read external 
resource".
 sql(" select  before_fix('2018-03-09')").

4) perform insert table. 
 sql("insert into  check_udf values(1)").show

Here ,insert should work.but is fails.










> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)
Sachin Ramachandra Setty created SPARK-27060:


 Summary: DDL Commands are accepting Keywords like create, drop as 
tableName
 Key: SPARK-27060
 URL: https://issues.apache.org/jira/browse/SPARK-27060
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Sachin Ramachandra Setty
 Fix For: 2.4.0, 2.3.2


Seems to be a compatibility issue compared to other components such as hive and 
mySql. 
DDL commands are successful even though the tableName is same as keyword. 

Tested with columnNames as well and issue exists. 

Whereas, Hive-Beeline is throwing ParseException and not accepting keywords as 
tableName or columnName and mySql is accepting keywords only as columnName.


Spark-Behaviour :

Connected to: Spark SQL (version 2.3.2.0101)
CLI_DBMS_APPID
Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.255 seconds)
0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.257 seconds)
0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.236 seconds)
0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.168 seconds)
0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.111 seconds)
0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (0.093 seconds)



Hive-Behaviour :

Connected to: Apache Hive (version 3.1.0)
Driver: Hive JDBC (version 3.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.0 by Apache Hive

0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
Error: Error while compiling statement: FAILED: ParseException line 1:13 cannot 
recognize input near 'create' '(' 'id' in table name (state=42000,code=4)

0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
Error: Error while compiling statement: FAILED: ParseException line 1:13 cannot 
recognize input near 'drop' '(' 'id' in table name (state=42000,code=4)

0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
Error: Error while compiling statement: FAILED: ParseException line 1:18 cannot 
recognize input near 'float' 'float' ')' in column name or constraint 
(state=42000,code=4)

0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
Error: Error while compiling statement: FAILED: ParseException line 1:11 cannot 
recognize input near 'create' '(' 'id' in table name (state=42000,code=4)

0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
Error: Error while compiling statement: FAILED: ParseException line 1:11 cannot 
recognize input near 'drop' '(' 'id' in table name (state=42000,code=4)

mySql :
CREATE TABLE CREATE(ID integer);
Error: near "CREATE": syntax error

CREATE TABLE DROP(ID integer);
Error: near "DROP": syntax error

CREATE TABLE TAB1(FLOAT FLOAT);
Success








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784526#comment-16784526
 ] 

Chakravarthi commented on SPARK-26602:
--

[~srowen] agree,but it should not make any other subsequent query (at least 
query which does not refer that UDF) to fail right? . Any insert  or select on 
the existing table itself is failing. 

[~ajithshetty] Yes,it makes all the subsequent query to fail,not only the query 
which refers to that UDF.

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784528#comment-16784528
 ] 

Sean Owen commented on SPARK-26602:
---

If a user adds something to the classpath, it matters to the whole classpath. 
If it's missing, I think it's surprising to ignore that fact. Something else 
will fail eventually. I understand you're saying, what if it doesn't affect 
some other UDFs? but I'm not sure we can know that. I would not make this 
change.

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27005) Design sketch: Accelerator-aware scheduling

2019-03-05 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555
 ] 

Thomas Graves commented on SPARK-27005:
---

so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.

> Design sketch: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26602) Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Chakravarthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chakravarthi updated SPARK-26602:
-
Summary: Subsequent queries are failing after querying the UDF which is 
loaded with wrong hdfs path  (was: Insert into table fails after querying the 
UDF which is loaded with wrong hdfs path)

> Subsequent queries are failing after querying the UDF which is loaded with 
> wrong hdfs path
> --
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23521:
--
Attachment: SPIP_ Standardize logical plans.pdf

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736
 ] 

Ryan Blue commented on SPARK-23521:
---

I've turned off commenting on the google doc to preserve its state, with the 
existing comments. I'm also adding a PDF of the final proposal to this issue.

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27067.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27012) Storage tab shows rdd details even after executor ended

2019-03-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27012.

   Resolution: Fixed
 Assignee: Ajith S
Fix Version/s: 3.0.0

> Storage tab shows rdd details even after executor ended
> ---
>
> Key: SPARK-27012
> URL: https://issues.apache.org/jira/browse/SPARK-27012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.3, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> After we cache a table, we can see its details in Storage Tab of spark UI. If 
> the executor has shutdown ( graceful shutdown/ Dynamic executor scenario) UI 
> still shows the rdd as cached and when we click the link it throws error. 
> This is because on executor remove event, we fail to adjust rdd partition 
> details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13091) Rewrite/Propagate constraints for Aliases

2019-03-05 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784776#comment-16784776
 ] 

Ajith S commented on SPARK-13091:
-

can this document be made accessible.? 
[https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze]

> Rewrite/Propagate constraints for Aliases
> -
>
> Key: SPARK-13091
> URL: https://issues.apache.org/jira/browse/SPARK-13091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Major
> Fix For: 2.0.0
>
>
> We'd want to duplicate constraints when there is an alias (i.e. for "SELECT 
> a, a AS b", any constraints on a now apply to b)
> This is a follow up task based on [~marmbrus]'s suggestion in 
> https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-05 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784791#comment-16784791
 ] 

t oo commented on SPARK-26998:
--

[~gsomogyi] please take it forward.

[~kabhwan] truststore password being shown is not much of a problem since 
truststore is often distributed to users anyway. But keystore password still 
being shown is the big no-no.

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:52 PM:
-

Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.



was (Author: skonto):
Yes some other things that I noticed is when the images are pulled this may 
take time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos commented on SPARK-27063:
-

Yes some other things that I noticed is when the images are pulled this may 
take time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Attachment: SPIP_ Spark API for Table Metadata.pdf

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27059.

Resolution: Invalid

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784762#comment-16784762
 ] 

Marcelo Vanzin commented on SPARK-27059:


Sounds like a problem with your system. Maybe your PATH has the wrong 
{{spark-submit}} in it.

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property

2019-03-05 Thread Andreas Adamides (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784758#comment-16784758
 ] 

Andreas Adamides commented on SPARK-27059:
--

Indeed, when in spark 2.4.0 and 2.3.3 running

*spark-submit --version*

returns "version 2.2.1" (as well as spark-shell)

So if not from the official Spark Download Page, where would I download the 
latest advertised spark version that supports Kubernetes.

> spark-submit on kubernetes cluster does not recognise k8s --master property
> ---
>
> Key: SPARK-27059
> URL: https://issues.apache.org/jira/browse/SPARK-27059
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Andreas Adamides
>Priority: Blocker
>
> I have successfully installed a Kubernetes cluster and can verify this by:
> {{C:\windows\system32>kubectl cluster-info }}
>  {{*Kubernetes master is running at https://:* }}
>  *{{KubeDNS is running at 
> https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*
> Trying to run the SparkPi with the Spark release I downloaded from 
> [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3)
> *{{spark-submit --master k8s://https://: --deploy-mode cluster 
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf 
> spark.executor.instances=2 --conf 
> spark.kubernetes.container.image=gettyimages/spark 
> c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}*
> I am getting this error:
> *{{Error: Master must either be yarn or start with spark, mesos, local Run 
> with --help for usage help or --verbose for debug output}}*
> I also tried:
> *{{spark-submit --help}}*
> to see what I can get regarding the *--master* property. This is what I get:
> *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.}}*
>  
> According to the documentation 
> [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on 
> running Spark workloads in Kubernetes, spark-submit does not even seem to 
> recognise the k8s value for master. [ included in possible Spark masters: 
> [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls]
>  ]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26928) Add driver CPU Time to the metrics system

2019-03-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26928.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23838
[https://github.com/apache/spark/pull/23838]

> Add driver CPU Time to the metrics system
> -
>
> Key: SPARK-26928
> URL: https://issues.apache.org/jira/browse/SPARK-26928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> This proposes to add instrumentation for the driver's JVM CPU time via the 
> Spark Dropwizard/Codahale metrics system. It follows directly from previous 
> work SPARK-25228 and shares similar motivations: it is intended as an 
> improvement to be used for Spark performance dashboards and monitoring 
> tools/instrumentation.
> Additionally this proposes a new configuration parameter 
> `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to 
> turn on the new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26928) Add driver CPU Time to the metrics system

2019-03-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26928:
--

Assignee: Luca Canali

> Add driver CPU Time to the metrics system
> -
>
> Key: SPARK-26928
> URL: https://issues.apache.org/jira/browse/SPARK-26928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to add instrumentation for the driver's JVM CPU time via the 
> Spark Dropwizard/Codahale metrics system. It follows directly from previous 
> work SPARK-25228 and shares similar motivations: it is intended as an 
> improvement to be used for Spark performance dashboards and monitoring 
> tools/instrumentation.
> Additionally this proposes a new configuration parameter 
> `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to 
> turn on the new feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27043) Nested schema pruning benchmark for ORC

2019-03-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27043.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23955

> Nested schema pruning benchmark for ORC
> ---
>
> Key: SPARK-27043
> URL: https://issues.apache.org/jira/browse/SPARK-27043
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We have benchmark of nested schema pruning, but only for Parquet. This adds 
> similar benchmark for ORC. This is used with nested schema pruning of ORC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714
 ] 

Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:53 PM:
-

Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire (if you dont use the local deamon to build stuff for 
whatever reason).
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.



was (Author: skonto):
Yes some other thing that I noticed is when the images are pulled this may take 
time and tests will expire.
Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience 
differently because some tests may run too fast for good or bad.


> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27067:
-

 Summary: SPIP: Catalog API for table metadata
 Key: SPARK-27067
 URL: https://issues.apache.org/jira/browse/SPARK-27067
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27066.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27066:
-

 Summary: SPIP: Identifiers for multi-catalog support
 Key: SPARK-27066
 URL: https://issues.apache.org/jira/browse/SPARK-27066
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Description: Goal: Define a catalog API to create, alter, load, and drop 
tables

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Description: 
Goals:
 * Propose semantics for identifiers and a listing API to support multiple 
catalogs
 ** Support any namespace scheme used by an external catalog
 ** Avoid traversing namespaces via multiple listing calls from Spark
 * Outline migration from the current behavior to Spark with multiple catalogs

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>
> Goals:
>  * Propose semantics for identifiers and a listing API to support multiple 
> catalogs
>  ** Support any namespace scheme used by an external catalog
>  ** Avoid traversing namespaces via multiple listing calls from Spark
>  * Outline migration from the current behavior to Spark with multiple catalogs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Priority: Minor  (was: Major)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Minor
>  Labels: easyfix, pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName

2019-03-05 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 Actually, CatalogImpl cache table with received table name, instead of only 
the table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received tableName. Here is the proposed changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

and  CatalogImpl register cache with received table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received table name instead. 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, 

[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27065:


Assignee: Apache Spark  (was: Wenchen Fan)

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27065:


Assignee: Wenchen Fan  (was: Apache Spark)

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName

2019-03-05 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If _CatalogImpl.refreshTable()_ method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 CatalogImpl cache table with received _tableName_, instead of 
_tableIdent.table_
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received _tableName_. Here is the proposed line of changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)){code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 Actually, CatalogImpl cache table with received table name, instead of only 
the table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received tableName. Here is the proposed changes.

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 


> CatalogImpl.refreshTable should register query in cache with 

[jira] [Commented] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784654#comment-16784654
 ] 

Apache Spark commented on SPARK-27065:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23927

> avoid more than one active task set managers for a stage
> 
>
> Key: SPARK-27065
> URL: https://issues.apache.org/jira/browse/SPARK-27065
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27005) Design sketch: Accelerator-aware scheduling

2019-03-05 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555
 ] 

Thomas Graves edited comment on SPARK-27005 at 3/5/19 3:40 PM:
---

so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.  I added comments 
to the google doc


was (Author: tgraves):
so we have both a google design doc and the comment above, can you consolidate 
into 1 place?  the google doc might be easier to comment on.

> Design sketch: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575
 ] 

Sachin Ramachandra Setty edited comment on SPARK-27060 at 3/5/19 3:40 PM:
--

I verified this issue with Spark 2.3.2 and Spark 2.4.0 versions


was (Author: sachin1729):
I verified this issue with 2.3.2 and 2.4.0 .

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-05 Thread Sujith Chacko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782840#comment-16782840
 ] 

Sujith Chacko edited comment on SPARK-27036 at 3/5/19 3:49 PM:
---

It seems to be the problem area is   BroadcastExchangeExec  in driver where  as 
part of Future a particular job will be fired and collected data will be 
broadcasted. 

The main problem is system will submit the job and its respective stage/tasks 
through DAGScheduler,  where the scheduler thread will schedule the respective 
events , In BroadcastExchangeExec when future time out happens respective 
exception will thrown but the jobs/task which is  scheduled by  the  
DAGScheduler as part of the action called in future will not be cancelled, I 
think we shall cancel the respective job to avoid  running the same in  
background even after Future time out exception, this can help to terminate the 
job promptly when TimeOutException happens, this will also save the additional 
resources getting utilized even after timeout exception thrown from driver. 

I want to give an attempt to handle this issue, Any comments suggestions are 
welcome.

cc [~b...@cloudera.com] [~hvanhovell] [~srowen]


was (Author: s71955):
It seems to be the problem area is   BroadcastExchangeExec  in driver where  as 
part of Future a particular job will be fired and collected data will be 
broadcasted. 

The main problem is system will submit the job and its respective stage/tasks 
through DAGScheduler,  where the scheduler thread will schedule the respective 
events , In BroadcastExchangeExec when future time out happens respective 
exception will thrown but the jobs/task which is  scheduled by  the  
DAGScheduler as part of the action called in future will not be cancelled, I 
think we shall cancel the respective job to avoid  running the same in  
background even after Future time out exception, this can help to terminate the 
job promptly when TimeOutException happens, this will also save the additional 
resources getting utilized even after timeout exception thrown from driver. 

I want to give an attempt to handle this issue, Any comments suggestions are 
welcome.

cc [~sro...@scient.com] [~b...@cloudera.com] [~hvanhovell]

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
>  
> !image-2019-03-04-00-39-38-779.png!
> !image-2019-03-04-00-39-12-210.png!
>  
>  wait for some time
> !image-2019-03-04-00-38-52-401.png!
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
>  create Table t1(Big Table,1M Records)
>  val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
>  val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
>  spark.sql("set spark.sql.broadcastTimeout=2").show(false)
>  Run Below Query 
>  spark.sql("create table s using parquet as select t1.* from csv_2 as 
> t1,csv_1 as t2 where 

[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Labels: easyfix pull-request-available  (was: easyfix)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Major
>  Labels: easyfix, pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27062:


Assignee: (was: Apache Spark)

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27062:


Assignee: Apache Spark

> Refresh Table command register table with table name only
> -
>
> Key: SPARK-27062
> URL: https://issues.apache.org/jira/browse/SPARK-27062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: William Wong
>Assignee: Apache Spark
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If CatalogImpl.refreshTable() method is invoked against a cached table, this 
> method would first uncache corresponding query in the shared state cache 
> manager, and then cache it back to refresh the cache copy. 
> However, the table was recached with only 'table name'. The database name 
> will be missed. Therefore, if cached table is not on the default database, 
> the recreated cache may refer to a different table. For example, we may see 
> the cached table name in driver's storage page will be changed after table 
> refreshing. 
>  
> Here is related code on github for your reference. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
>  
>  
> {code:java}
> override def refreshTable(tableName: String): Unit = {
>   val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
>   val tableMetadata = 
> sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
>   val table = sparkSession.table(tableIdent)
>   if (tableMetadata.tableType == CatalogTableType.VIEW) {
> // Temp or persistent views: refresh (or invalidate) any metadata/data 
> cached
> // in the plan recursively.
> table.queryExecution.analyzed.refresh()
>   } else {
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
>   }
>   // If this table is cached as an InMemoryRelation, drop the original
>   // cached version and make the new version cached lazily.
>   if (isCached(table)) {
> // Uncache the logicalPlan.
> sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
> blocking = true)
> // Cache it again.
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
>   }
> }
> {code}
>  
>  
> In Spark SQL module, the database name is registered together with table name 
> when "CACHE TABLE" command was executed. 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
>  
> and  CatalogImpl register cache with received table name. 
> {code:java}
> override def cacheTable(tableName: String): Unit = {
> sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName),
>  Some(tableName)) }
> {code}
>  
> Therefore, I would like to propose aligning the behavior. RefreshTable method 
> should reuse the received table name instead. 
>  
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> {code}
> to 
> {code:java}
> sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611
 ] 

Ajith S commented on SPARK-26602:
-

# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour)

Please correct me if i am wrong  cc [~srowen]

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27063:


Assignee: (was: Apache Spark)

> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27063:


Assignee: Apache Spark

> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27064:


Assignee: Apache Spark  (was: Wenchen Fan)

> create StreamingWrite at the begining of streaming execution
> 
>
> Key: SPARK-27064
> URL: https://issues.apache.org/jira/browse/SPARK-27064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27064:


Assignee: Wenchen Fan  (was: Apache Spark)

> create StreamingWrite at the begining of streaming execution
> 
>
> Key: SPARK-27064
> URL: https://issues.apache.org/jira/browse/SPARK-27064
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sachin Ramachandra Setty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575
 ] 

Sachin Ramachandra Setty commented on SPARK-27060:
--

I verified this issue with 2.3.2 and 2.4.0 .

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM:
-

-Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)
William Wong created SPARK-27062:


 Summary: Refresh Table command register table with table name only
 Key: SPARK-27062
 URL: https://issues.apache.org/jira/browse/SPARK-27062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: William Wong


If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

 

 

Therefore, I would like to propose aligning the behavior. Full table name 
should also be used in RefreshTable case.  We should change the following line 
in CatalogImpl.refreshTable from 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to

 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.quotedString))
 {code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27064) create StreamingWrite at the begining of streaming execution

2019-03-05 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27064:
---

 Summary: create StreamingWrite at the begining of streaming 
execution
 Key: SPARK-27064
 URL: https://issues.apache.org/jira/browse/SPARK-27064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27065) avoid more than one active task set managers for a stage

2019-03-05 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27065:
---

 Summary: avoid more than one active task set managers for a stage
 Key: SPARK-27065
 URL: https://issues.apache.org/jira/browse/SPARK-27065
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.0, 2.3.3
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-05 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784666#comment-16784666
 ] 

Ajith S commented on SPARK-26727:
-

[~rigolaszlo] i see that from stacktrace ThriftHiveMetastore$Client is used 
which is a sync client for metrastore. Can you explain how you find that drop 
command is async.?

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
>  at 
> 

[jira] [Updated] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27060:
--
Target Version/s:   (was: 2.4.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 2.3.2)
  (was: 2.4.0)

Don't set Fix or Target Version.
This isn't my area, but I agree it seems surprising if you can create a table 
called "CREATE".
Please post your Spark reproduction and version though.

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> Hive-Behaviour :
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM:
-

~Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.~

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
-Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27061) Expose 4040 port on driver service to access logs using service

2019-03-05 Thread Chandu Kavar (JIRA)
Chandu Kavar created SPARK-27061:


 Summary: Expose 4040 port on driver service to access logs using 
service
 Key: SPARK-27061
 URL: https://issues.apache.org/jira/browse/SPARK-27061
 Project: Spark
  Issue Type: Task
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Chandu Kavar


Currently, we can access the driver logs using 

{{kubectl port-forward  4040:4040}}

mentioned in 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui]

We have users who submit spark jobs to Kubernetes, but they don't have access 
to the cluster. so, they can't user kubectl port-forward command.

If we can expose 4040 port on driver service, we can easily relay these logs to 
UI using driver service and Nginx reverse proxy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only

2019-03-05 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27062:
-
Description: 
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

and  CatalogImpl register cache with received table name. 
{code:java}
override def cacheTable(tableName: String): Unit = {
sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), 
Some(tableName)) }
{code}
 

Therefore, I would like to propose aligning the behavior. RefreshTable method 
should reuse the received table name instead. 

 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table))
{code}
to 
{code:java}
sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName))
 {code}
 

  was:
If CatalogImpl.refreshTable() method is invoked against a cached table, this 
method would first uncache corresponding query in the shared state cache 
manager, and then cache it back to refresh the cache copy. 

However, the table was recached with only 'table name'. The database name will 
be missed. Therefore, if cached table is not on the default database, the 
recreated cache may refer to a different table. For example, we may see the 
cached table name in driver's storage page will be changed after table 
refreshing. 

 

Here is related code on github for your reference. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala]
 

 
{code:java}
override def refreshTable(tableName: String): Unit = {
  val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
  val tableMetadata = 
sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
  val table = sparkSession.table(tableIdent)

  if (tableMetadata.tableType == CatalogTableType.VIEW) {
// Temp or persistent views: refresh (or invalidate) any metadata/data 
cached
// in the plan recursively.
table.queryExecution.analyzed.refresh()
  } else {
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)
  }

  // If this table is cached as an InMemoryRelation, drop the original
  // cached version and make the new version cached lazily.
  if (isCached(table)) {
// Uncache the logicalPlan.
sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, 
blocking = true)
// Cache it again.
sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
  }
}
{code}
 

 

In Spark SQL module, the database name is registered together with table name 
when "CACHE TABLE" command was executed. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala]
 

 

 

Therefore, I would like to propose aligning the behavior. Full table name 
should also be used in RefreshTable case.  We should change the following line 
in CatalogImpl.refreshTable from 

 
{code:java}

[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path

2019-03-05 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611
 ] 

Ajith S edited comment on SPARK-26602 at 3/5/19 4:15 PM:
-

# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour, we get classnotfoundexception when the 
missing class is actually referred, until then JVM is happily running)

Please correct me if i am wrong  cc [~srowen]


was (Author: ajithshetty):
# I have a question about this issue in thrift-server case. If admin does a add 
jar with a non-existing jar (may be a human error), it will cause all the 
ongoing beeline sessions to fail  ( even a query where jar is not needed at 
all). and only way to recover is restart of thrift-server
 #  As you said, "If a user adds something to the classpath, it matters to the 
whole classpath. If it's missing, I think it's surprising to ignore that fact" 
- but unless the user refers to the jar, is it ok to fail all of his 
operations.? (just like JVM behaviour)

Please correct me if i am wrong  cc [~srowen]

> Insert into table fails after querying the UDF which is loaded with wrong 
> hdfs path
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
> Attachments: beforeFixUdf.txt
>
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-05 Thread Rob Vesse (JIRA)
Rob Vesse created SPARK-27063:
-

 Summary: Spark on K8S Integration Tests timeouts are too short for 
some test clusters
 Key: SPARK-27063
 URL: https://issues.apache.org/jira/browse/SPARK-27063
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Rob Vesse


As noted during development for SPARK-26729 there are a couple of integration 
test timeouts that are too short when running on slower clusters e.g. 
developers laptops, small CI clusters etc

[~skonto] confirmed that he has also experienced this behaviour in the 
discussion on PR [PR 
23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]

We should up the defaults of this timeouts as an initial step and longer term 
consider making the timeouts themselves configurable




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:39 PM:
-

-Guys, is there a workaround for the folks that can't upgrade Spark version? 
Thanks.-

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)


was (Author: pedromorfeu):
~Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.~

Here's my workaround for, say, 10 aggregation operations:
 # dataframe1 = aggregations 1 to 5
 # dataframe2 = aggregations 6 to 10
 # dataframe1.join(dataframe2)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >