[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785277#comment-16785277 ] Jungtaek Lim commented on SPARK-26998: -- [~toopt4] Yeah I tend to agree that hiding more credential things are better so supportive on the change. Maybe I thought about the description of Jira issue your patch was originally landed. Btw, are there any existing test or manual test to verify whether keystore password and key password are not used? Just curious, I honestly don't know about it. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
TAESUK KIM created SPARK-27069: -- Summary: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) Key: SPARK-27069 URL: https://issues.apache.org/jira/browse/SPARK-27069 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Below is my environment DataSet # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) # Word : about 3553918(can't change) Spark environment # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) # executor-core,driver-core : 3 # spark.serializer : default and org.apache.spark.serializer.KryoSerializer(both fail) # spark.executor.memoryOverhead : 18G --> 36G fail Jave version : 1.8.0_191 (Oracle Corporation) Reporter: TAESUK KIM I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed version , ml ) using Spark 2.3.2(emr-5.18.0) . After that I want to transform new DataSet by using that model. But when I transform new data, I alway get error related memory error. I changed data size from x 0.1 , to x 0.01. But always get memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) That hugeCapacity error(overflow) is happened when size of array is over Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why this error is happened. And I want to change serializer to KryoSerializer. But I found this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call org.apache.spark.serializer.JavaSerializationStream even though I register KryoClasses Is there any thing I can do ? Below is code {{val countvModel = CountVectorizerModel.load("s3://~/") }} {{val ldaModel = DistributedLDAModel.load("s3://~/") }} {{val transformeddata=countvModel.transform(inputData).select("productid", "itemid", "ptkString", "features") var featureldaDF = ldaModel.transform(transformeddata).select("productid", "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }} Other testing # Java option : UseParallelGC , UseG1GC (all fail) Below is log {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107) at
[jira] [Created] (SPARK-27056) Remove `start-shuffle-service.sh`
liuxian created SPARK-27056: --- Summary: Remove `start-shuffle-service.sh` Key: SPARK-27056 URL: https://issues.apache.org/jira/browse/SPARK-27056 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 3.0.0 Reporter: liuxian _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than start-shuffle-service.sh. So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23339) Spark UI not loading *.js/*.css files, only raw HTML
[ https://issues.apache.org/jira/browse/SPARK-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784204#comment-16784204 ] gary commented on SPARK-23339: -- exclude servlet-api-2.5.jar, use servlet-api-3.1.0.jar. It works for me. > Spark UI not loading *.js/*.css files, only raw HTML > > > Key: SPARK-23339 > URL: https://issues.apache.org/jira/browse/SPARK-23339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, YARN, 2 Ubuntu 16.04 nodes, openjdk > 1.8.0_151 >Reporter: Erik Baumert >Priority: Major > Attachments: 3LCeC.png > > > I have never reported anything before, and hope this is the right place as I > think I have come across a bug. If I missed the solution, please feel free to > correct me. > I set up Spark 2.2.0 on a 2-node Ubuntu cluster. I use Jupyter notebook to > access the pyspark-shell. However, the UI via > [http://IP:4040/|http://ip:4040/] is broken. Has anyone ever seen something > like this? > When I inspect the page in Chrome, it says "Failed to load resource: > net::ERR_EMPTY_RESPONSE" for various .js and .css files. > I did a fresh install and added my configurations until the problem occurred > again. Everything works fine until I edited the spark-defaults.conf to > contain the following line: > spark.driver.extraClassPath > /usr/local/phoenix/phoenix-4.13.0-HBase-1.3-client.jar > spark.executor.extraClassPath > /usr/local/phoenix/phoenix-4.13.0-HBase-1.3-client.jar > How to add these jar to my class path without breaking the UI? If I just > supply them using the --jars parameter in the Terminal it works fine. But I'd > like to have them configured, as explained in the manual: > [https://phoenix.apache.org/phoenix_spark.html] > > I posted the question on Stackoverflow some time ago > [here|https://stackoverflow.com/questions/47291547/spark-ui-fails-to-load-js-displays-bare-html] > and apparently I'm not the only one > ([here|https://stackoverflow.com/questions/47875064/spark-ui-appears-with-wrong-format]). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27055) Update Structured Streaming documentation because of DSv2 changes
Gabor Somogyi created SPARK-27055: - Summary: Update Structured Streaming documentation because of DSv2 changes Key: SPARK-27055 URL: https://issues.apache.org/jira/browse/SPARK-27055 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi Since SPARK-26956 has been merged the Structured Streaming documentation has to be updated also to reflect the changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27056: Assignee: (was: Apache Spark) > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27056: Assignee: Apache Spark > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Minor > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784194#comment-16784194 ] Shahid K I edited comment on SPARK-27019 at 3/5/19 7:59 AM: Please upload the screenshot of the sql page of the second scenario. I don't think in that case it will display like that. The issue happens only when the new live execution data is overwritten by the existing one was (Author: shahid): Please show me the screenshot of the sql page of the second scenario. I don't think in that case it will display like that. The issue happens only when the new live execution data is overwritten by the existing one > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784200#comment-16784200 ] Martin Studer commented on SPARK-20415: --- I'm observing a similar issue where all executor tasks would hang in the following state: {noformat} org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:363) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$5.apply(GenerateExec.scala:120) org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$apply$5.apply(GenerateExec.scala:118) scala.collection.Iterator$$anon$11.next(Iterator.scala:409) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) org.apache.spark.scheduler.Task.run(Task.scala:108) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {noformat} This is with Spark 2.2.0. > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K >Priority: Major > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at >
[jira] [Commented] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784260#comment-16784260 ] peay commented on SPARK-27019: -- Yes, I had edited my message above shortly after posting - cannot reproduce in the second scenario. Thanks! > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26850) Make EventLoggingListener LOG_FILE_PERMISSIONS configurable
[ https://issues.apache.org/jira/browse/SPARK-26850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta resolved SPARK-26850. --- Resolution: Duplicate > Make EventLoggingListener LOG_FILE_PERMISSIONS configurable > --- > > Key: SPARK-26850 > URL: https://issues.apache.org/jira/browse/SPARK-26850 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.2, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > private[spark] object EventLoggingListener extends Logging { > ... > private val LOG_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("770", > 8).toShort) > ... > } > > Currently the event log files are hard-coded with permission 770. > It would be fine if this permission is +configurable+. > User case: The spark application is submitted by user A but the spark history > server is started by user B. Currently user B cannot access the history event > files created by user A. When permission is set to 775, this will be possible. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27056: Description: _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than _start-shuffle-service.sh_. So now we should delete _start-shuffle-service.sh_ in case users use it. was: _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than start-shuffle-service.sh. So now we should delete _start-shuffle-service.sh_ in case users use it. > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chakravarthi updated SPARK-26602: - Attachment: beforeFixUdf.txt > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27055) Update Structured Streaming documentation because of DSv2 changes
[ https://issues.apache.org/jira/browse/SPARK-27055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-27055: -- Description: Since SPARK-26956 has been merged the Structured Streaming documentation has to be updated also to reflect the changes. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes was:Since SPARK-26956 has been merged the Structured Streaming documentation has to be updated also to reflect the changes. > Update Structured Streaming documentation because of DSv2 changes > - > > Key: SPARK-27055 > URL: https://issues.apache.org/jira/browse/SPARK-27055 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > Since SPARK-26956 has been merged the Structured Streaming documentation has > to be updated also to reflect the changes. > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27057) Common trait for limit exec operators
Maxim Gekk created SPARK-27057: -- Summary: Common trait for limit exec operators Key: SPARK-27057 URL: https://issues.apache.org/jira/browse/SPARK-27057 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the UnaryExecNode trait as the common trait. It is slightly inconvenient to distinguish those operators from others. The ticket aims to introduce new trait for all 3 operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
Andreas Adamides created SPARK-27059: Summary: spark-submit on kubernetes cluster does not recognise k8s --master property Key: SPARK-27059 URL: https://issues.apache.org/jira/browse/SPARK-27059 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0, 2.3.3 Reporter: Andreas Adamides I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info Kubernetes master is running at https://: KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}} Then I am trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) {{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}} I am getting this error: {{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}} I also tried: {{spark-submit --help}} to see what I can get regarding the *--master* property. This is what I get: {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}} According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134 ] Chakravarthi edited comment on SPARK-26602 at 3/5/19 2:31 PM: -- Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue is,Insert into table fails after querying the UDF which is loaded with wrong hdfs path. Below are the steps to reproduce this issue: 1) create a table. sql("create table table1(I int)"); 2) create udf using invalid hdfs path. sql("CREATE FUNCTION before_fix AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/notexist.jar'") 3) Do select on the UDF and you will get exception as "Failed to read external resource". sql(" select before_fix('2018-03-09')"). 4) perform insert table or select on any table.It will fail. sql("insert into table1 values(1)").show sql("select * from table1 ").show Here ,insert should work.but is fails. was (Author: chakravarthi): Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue is,Insert into table fails after querying the UDF which is loaded with wrong hdfs path. Below are the steps to reproduce this issue: 1) create a table. sql("create table table1(I int)"); 2) create udf using invalid hdfs path. sql("CREATE FUNCTION before_fix AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/notexist.jar'") 3) Do select on the UDF and you will get exception as "Failed to read external resource". sql(" select before_fix('2018-03-09')"). 4) perform insert table. sql("insert into table1 values(1)").show Here ,insert should work.but is fails. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503 ] Sujith Chacko commented on SPARK-27060: --- This looks like a compatibility issue with other engines. Will try to handle this cases. cc [~sro...@scient.com] [cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan] let us know for any suggestions. Thanks > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27058) Support mounting host dirs for K8s tests
[ https://issues.apache.org/jira/browse/SPARK-27058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784396#comment-16784396 ] Stavros Kontopoulos commented on SPARK-27058: - [~shaneknapp] added this to keep track of things, as you requested. > Support mounting host dirs for K8s tests > > > Key: SPARK-27058 > URL: https://issues.apache.org/jira/browse/SPARK-27058 > Project: Spark > Issue Type: Improvement > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > According to the discussion > [here|https://github.com/apache/spark/pull/23514], supporting PVs tests > requires mounting a host dir. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27058) Support mounting host dirs for K8s tests
Stavros Kontopoulos created SPARK-27058: --- Summary: Support mounting host dirs for K8s tests Key: SPARK-27058 URL: https://issues.apache.org/jira/browse/SPARK-27058 Project: Spark Issue Type: Improvement Components: jenkins Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos According to the discussion [here|https://github.com/apache/spark/pull/23514], supporting PVs tests requires mounting a host dir. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Adamides updated SPARK-27059: - Description: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} {{ *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}} Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] was: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{Kubernetes master is running at https://: }} {{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}} Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) {{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}} I am getting this error: {{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}} I also tried: {{spark-submit --help}} to see what I can get regarding the *--master* property. This is what I get: {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}} According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > {{ *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}} > Trying to run the SparkPi with the Spark I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: >
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784434#comment-16784434 ] Jungtaek Lim commented on SPARK-26998: -- If I understand correctly, the PR would mitigate the issue (remove some of unnecessary password parameters being passed) but not completely solve the issue, sine truststore password parameters will be still passed as it was. To handle issue correctly we need to have secured storage to share the security information. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784446#comment-16784446 ] Sean Owen commented on SPARK-26602: --- That sounds like user error. I'd close this as NotAProblem. It will cause a less-clear error later anyway > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes commented on SPARK-23986: - Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks. > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784506#comment-16784506 ] Ajith S commented on SPARK-26602: - [~chakravarthi] Hi, thanks for reporting the issue, From your example it looks like a missing jar will cause any subsequent sqls (*sqls which do not refer to UDF*) also to fail in this session. Right.? cc [~srowen] > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base
[ https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26923: - Summary: Refactor ArrowRRunner and RRunner to share the same base (was: Refactor ArrowRRunner and RRunner to deduplicate codes) > Refactor ArrowRRunner and RRunner to share the same base > > > Key: SPARK-26923 > URL: https://issues.apache.org/jira/browse/SPARK-26923 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > ArrowRRunner and RRunner has already duplicated codes. We should refactor and > deduplicate them. Also, ArrowRRunner happened to have a rather hacky code > (see > https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61 > ). > We might even be able to deduplicate some codes with PythonRunners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784391#comment-16784391 ] Gabor Somogyi edited comment on SPARK-26998 at 3/5/19 12:51 PM: [~toopt4] thanks for the info. Are you working on this? If not I'm happy to push the solution forward. was (Author: gsomogyi): [~toopt4] thanks for the info. Are you working on this? If not happy to pushing the solution forward. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27057) Common trait for limit exec operators
[ https://issues.apache.org/jira/browse/SPARK-27057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27057: Assignee: Apache Spark > Common trait for limit exec operators > - > > Key: SPARK-27057 > URL: https://issues.apache.org/jira/browse/SPARK-27057 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Trivial > > Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the > UnaryExecNode trait as the common trait. It is slightly inconvenient to > distinguish those operators from others. The ticket aims to introduce new > trait for all 3 operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27057) Common trait for limit exec operators
[ https://issues.apache.org/jira/browse/SPARK-27057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27057: Assignee: (was: Apache Spark) > Common trait for limit exec operators > - > > Key: SPARK-27057 > URL: https://issues.apache.org/jira/browse/SPARK-27057 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Trivial > > Currently, CollectLimitExec, LocalLimitExec and GlobalLimitExec have the > UnaryExecNode trait as the common trait. It is slightly inconvenient to > distinguish those operators from others. The ticket aims to introduce new > trait for all 3 operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base
[ https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26923: Assignee: Apache Spark > Refactor ArrowRRunner and RRunner to share the same base > > > Key: SPARK-26923 > URL: https://issues.apache.org/jira/browse/SPARK-26923 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > ArrowRRunner and RRunner has already duplicated codes. We should refactor and > deduplicate them. Also, ArrowRRunner happened to have a rather hacky code > (see > https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61 > ). > We might even be able to deduplicate some codes with PythonRunners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784475#comment-16784475 ] Martin Studer commented on SPARK-20415: --- In fact, I believe the underlying issue is actually SPARK-21657. To be confirmed. > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K >Priority: Major > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The last messages ever printed in stderr before the hang are: > 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() > 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() > 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 > (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has > no missing parents > 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in > memory (estimated size 170.5 KB,
[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784503#comment-16784503 ] Sujith Chacko edited comment on SPARK-27060 at 3/5/19 3:11 PM: --- This looks like a compatibility issue with other engines. Will try to handle this cases. cc [~sro...@scient.com] [cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan] [~sro...@scient.com] [~sro...@yahoo.com] let us know for any suggestions. Thanks was (Author: s71955): This looks like a compatibility issue with other engines. Will try to handle this cases. cc [~sro...@scient.com] [cloud-fan|https://github.com/apache/spark/issues?q=is%3Apr+is%3Aopen+author%3Acloud-fan] let us know for any suggestions. Thanks > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Adamides updated SPARK-27059: - Description: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{Kubernetes master is running at https://: }} {{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}} Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) {{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}} I am getting this error: {{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}} I also tried: {{spark-submit --help}} to see what I can get regarding the *--master* property. This is what I get: {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}} According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] was: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info Kubernetes master is running at https://: KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}} Then I am trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) {{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}} I am getting this error: {{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}} I also tried: {{spark-submit --help}} to see what I can get regarding the *--master* property. This is what I get: {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}} According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{Kubernetes master is running at https://: }} > {{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}} > > Trying to run the SparkPi with the Spark I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > {{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}} > > I am getting this error: > > {{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}} > > I also tried: > > {{spark-submit --help}} > > to see what I can get regarding the *--master* property. This is what I get: > > {{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}} > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: >
[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Adamides updated SPARK-27059: - Description: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] was: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: >
[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784453#comment-16784453 ] Sean Owen commented on SPARK-26972: --- I haven't checked when it's case-insensitive, but to be clear: you should test vs master, with a correct setting of multiLine and lineSep to correctly parse this. So far these seem to explain all the behavior you see. > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV in the attached books.csv: > {noformat} > id;authorId;title;releaseDate;link > 1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P > 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP > 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr > 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry > Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n > 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT > 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? > An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1 > 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav > 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD > 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA > 11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I > 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ > 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW > 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk > 15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn > 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc > 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL > 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY > 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG > 20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W > 21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc > 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo > 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} > And this Java code: > {code:java} > Dataset df = spark.read().format("csv") > .option("header", "true") > .option("multiline", true) > .option("sep", ";") > .option("quote", "*") > .option("dateFormat", "M/d/y") > .option("inferSchema", true) > .load("data/books.csv"); > df.show(7); > df.printSchema(); > {code} > h1. In Spark v2.0.1 > Output: > {noformat} > +---+++---++ > | id|authorId| title|releaseDate|link| > +---+++---++ > | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| > | 2| 1|Harry Potter and ...|10/6/15|http://amzn.to/2l...| > | 3| 1|The Tales of Beed...|12/4/08|http://amzn.to/2k...| > | 4| 1|Harry Potter and ...|10/4/16|http://amzn.to/2k...| > | 5| 2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...| > | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| > | 7| 3|Adventures of Huc...|. 5/26/94|http://amzn.to/2w...| > +---+++---++ > only showing top 7 rows > Dataframe's schema: > root > |-- id: integer (nullable = true) > |-- authorId: integer (nullable = true) > |-- title: string (nullable = true) > |-- releaseDate: string (nullable = true) > |-- link: string (nullable = true) > {noformat} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {noformat} > ++++---++ > | id|authorId| title|releaseDate| link| > ++++---++ > | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| >
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784560#comment-16784560 ] Sachin Ramachandra Setty commented on SPARK-27060: -- cc [~srowen] > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base
[ https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26923: Assignee: (was: Apache Spark) > Refactor ArrowRRunner and RRunner to share the same base > > > Key: SPARK-26923 > URL: https://issues.apache.org/jira/browse/SPARK-26923 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > ArrowRRunner and RRunner has already duplicated codes. We should refactor and > deduplicate them. Also, ArrowRRunner happened to have a rather hacky code > (see > https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61 > ). > We might even be able to deduplicate some codes with PythonRunners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Adamides updated SPARK-27059: - Description: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* Trying to run the SparkPi with the Spark release I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] was: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: >
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784391#comment-16784391 ] Gabor Somogyi commented on SPARK-26998: --- [~toopt4] thanks for the info. Are you working on this? If not happy to pushing the solution forward. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784406#comment-16784406 ] Gabor Somogyi commented on SPARK-26998: --- {quote} Can be resolved if below PR is merged: [[Github] Pull Request #21514 (tooptoop4)|https://github.com/apache/spark/pull/21514] {quote} I think it's just not true. #21514 is solving a UI problem where an application 'name' urls point to http instead of https (even when ssl enabled). Have I missed something? > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784410#comment-16784410 ] Gabor Somogyi commented on SPARK-26998: --- Ahaaa, I see now. 2 problems tried to be solved in one PR. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Adamides updated SPARK-27059: - Description: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] was: I have successfully installed a Kubernetes cluster and can verify this by: {{C:\windows\system32>kubectl cluster-info }} {{*Kubernetes master is running at https://:* }} {{ *{{KubeDNS is running at https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}*}} Trying to run the SparkPi with the Spark I downloaded from [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) *{{spark-submit --master k8s://https://: --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* I am getting this error: *{{Error: Master must either be yarn or start with spark, mesos, local Run with --help for usage help or --verbose for debug output}}* I also tried: *{{spark-submit --help}}* to see what I can get regarding the *--master* property. This is what I get: *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.}}* According to the documentation [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] ] > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: >
[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784134#comment-16784134 ] Chakravarthi edited comment on SPARK-26602 at 3/5/19 1:52 PM: -- Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue is,Insert into table fails after querying the UDF which is loaded with wrong hdfs path. Below are the steps to reproduce this issue: 1) create a table. sql("create table table1(I int)"); 2) create udf using invalid hdfs path. sql("CREATE FUNCTION before_fix AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/notexist.jar'") 3) Do select on the UDF and you will get exception as "Failed to read external resource". sql(" select before_fix('2018-03-09')"). 4) perform insert table. sql("insert into table1 values(1)").show Here ,insert should work.but is fails. was (Author: chakravarthi): Hi [~srowen] , this issue is not duplicate of SPARK-26560. Here the issue is,Insert into table fails after querying the UDF which is loaded with wrong hdfs path. Below are the steps to reproduce this issue: 1) create a table. sql("create table check_udf(I int)"); 2) create udf using invalid hdfs path. sql("CREATE FUNCTION before_fix AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/notexist.jar'") 3) Do select on the UDF and you will get exception as "Failed to read external resource". sql(" select before_fix('2018-03-09')"). 4) perform insert table. sql("insert into check_udf values(1)").show Here ,insert should work.but is fails. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
Sachin Ramachandra Setty created SPARK-27060: Summary: DDL Commands are accepting Keywords like create, drop as tableName Key: SPARK-27060 URL: https://issues.apache.org/jira/browse/SPARK-27060 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.2 Reporter: Sachin Ramachandra Setty Fix For: 2.4.0, 2.3.2 Seems to be a compatibility issue compared to other components such as hive and mySql. DDL commands are successful even though the tableName is same as keyword. Tested with columnNames as well and issue exists. Whereas, Hive-Beeline is throwing ParseException and not accepting keywords as tableName or columnName and mySql is accepting keywords only as columnName. Spark-Behaviour : Connected to: Spark SQL (version 2.3.2.0101) CLI_DBMS_APPID Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.255 seconds) 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.257 seconds) 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.236 seconds) 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.168 seconds) 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.111 seconds) 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.093 seconds) Hive-Behaviour : Connected to: Apache Hive (version 3.1.0) Driver: Hive JDBC (version 3.1.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.0 by Apache Hive 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); Error: Error while compiling statement: FAILED: ParseException line 1:13 cannot recognize input near 'create' '(' 'id' in table name (state=42000,code=4) 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); Error: Error while compiling statement: FAILED: ParseException line 1:13 cannot recognize input near 'drop' '(' 'id' in table name (state=42000,code=4) 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); Error: Error while compiling statement: FAILED: ParseException line 1:18 cannot recognize input near 'float' 'float' ')' in column name or constraint (state=42000,code=4) 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); Error: Error while compiling statement: FAILED: ParseException line 1:11 cannot recognize input near 'create' '(' 'id' in table name (state=42000,code=4) 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); Error: Error while compiling statement: FAILED: ParseException line 1:11 cannot recognize input near 'drop' '(' 'id' in table name (state=42000,code=4) mySql : CREATE TABLE CREATE(ID integer); Error: near "CREATE": syntax error CREATE TABLE DROP(ID integer); Error: near "DROP": syntax error CREATE TABLE TAB1(FLOAT FLOAT); Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784526#comment-16784526 ] Chakravarthi commented on SPARK-26602: -- [~srowen] agree,but it should not make any other subsequent query (at least query which does not refer that UDF) to fail right? . Any insert or select on the existing table itself is failing. [~ajithshetty] Yes,it makes all the subsequent query to fail,not only the query which refers to that UDF. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784528#comment-16784528 ] Sean Owen commented on SPARK-26602: --- If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact. Something else will fail eventually. I understand you're saying, what if it doesn't affect some other UDFs? but I'm not sure we can know that. I would not make this change. > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27005) Design sketch: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555 ] Thomas Graves commented on SPARK-27005: --- so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. > Design sketch: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26602) Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chakravarthi updated SPARK-26602: - Summary: Subsequent queries are failing after querying the UDF which is loaded with wrong hdfs path (was: Insert into table fails after querying the UDF which is loaded with wrong hdfs path) > Subsequent queries are failing after querying the UDF which is loaded with > wrong hdfs path > -- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23521: -- Attachment: SPIP_ Standardize logical plans.pdf > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736 ] Ryan Blue commented on SPARK-23521: --- I've turned off commenting on the google doc to preserve its state, with the existing comments. I'm also adding a PDF of the final proposal to this issue. > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27067. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27012) Storage tab shows rdd details even after executor ended
[ https://issues.apache.org/jira/browse/SPARK-27012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27012. Resolution: Fixed Assignee: Ajith S Fix Version/s: 3.0.0 > Storage tab shows rdd details even after executor ended > --- > > Key: SPARK-27012 > URL: https://issues.apache.org/jira/browse/SPARK-27012 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.3, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > > > After we cache a table, we can see its details in Storage Tab of spark UI. If > the executor has shutdown ( graceful shutdown/ Dynamic executor scenario) UI > still shows the rdd as cached and when we click the link it throws error. > This is because on executor remove event, we fail to adjust rdd partition > details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13091) Rewrite/Propagate constraints for Aliases
[ https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784776#comment-16784776 ] Ajith S commented on SPARK-13091: - can this document be made accessible.? [https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze] > Rewrite/Propagate constraints for Aliases > - > > Key: SPARK-13091 > URL: https://issues.apache.org/jira/browse/SPARK-13091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal >Priority: Major > Fix For: 2.0.0 > > > We'd want to duplicate constraints when there is an alias (i.e. for "SELECT > a, a AS b", any constraints on a now apply to b) > This is a follow up task based on [~marmbrus]'s suggestion in > https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784791#comment-16784791 ] t oo commented on SPARK-26998: -- [~gsomogyi] please take it forward. [~kabhwan] truststore password being shown is not much of a problem since truststore is often distributed to users anyway. But keystore password still being shown is the big no-no. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:52 PM: - Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. was (Author: skonto): Yes some other things that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos commented on SPARK-27063: - Yes some other things that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Attachment: SPIP_ Spark API for Table Metadata.pdf > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27059. Resolution: Invalid > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784762#comment-16784762 ] Marcelo Vanzin commented on SPARK-27059: Sounds like a problem with your system. Maybe your PATH has the wrong {{spark-submit}} in it. > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27059) spark-submit on kubernetes cluster does not recognise k8s --master property
[ https://issues.apache.org/jira/browse/SPARK-27059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784758#comment-16784758 ] Andreas Adamides commented on SPARK-27059: -- Indeed, when in spark 2.4.0 and 2.3.3 running *spark-submit --version* returns "version 2.2.1" (as well as spark-shell) So if not from the official Spark Download Page, where would I download the latest advertised spark version that supports Kubernetes. > spark-submit on kubernetes cluster does not recognise k8s --master property > --- > > Key: SPARK-27059 > URL: https://issues.apache.org/jira/browse/SPARK-27059 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.0 >Reporter: Andreas Adamides >Priority: Blocker > > I have successfully installed a Kubernetes cluster and can verify this by: > {{C:\windows\system32>kubectl cluster-info }} > {{*Kubernetes master is running at https://:* }} > *{{KubeDNS is running at > https://:/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy}}* > Trying to run the SparkPi with the Spark release I downloaded from > [https://spark.apache.org/downloads.html] .(I tried versions 2.4.0 and 2.3.3) > *{{spark-submit --master k8s://https://: --deploy-mode cluster > --name spark-pi --class org.apache.spark.examples.SparkPi --conf > spark.executor.instances=2 --conf > spark.kubernetes.container.image=gettyimages/spark > c:\users\\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar}}* > I am getting this error: > *{{Error: Master must either be yarn or start with spark, mesos, local Run > with --help for usage help or --verbose for debug output}}* > I also tried: > *{{spark-submit --help}}* > to see what I can get regarding the *--master* property. This is what I get: > *{{--master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local.}}* > > According to the documentation > [[https://spark.apache.org/docs/latest/running-on-kubernetes.html]] on > running Spark workloads in Kubernetes, spark-submit does not even seem to > recognise the k8s value for master. [ included in possible Spark masters: > [https://spark.apache.org/docs/latest/submitting-applications.html#master-urls] > ] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26928) Add driver CPU Time to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26928. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23838 [https://github.com/apache/spark/pull/23838] > Add driver CPU Time to the metrics system > - > > Key: SPARK-26928 > URL: https://issues.apache.org/jira/browse/SPARK-26928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > This proposes to add instrumentation for the driver's JVM CPU time via the > Spark Dropwizard/Codahale metrics system. It follows directly from previous > work SPARK-25228 and shares similar motivations: it is intended as an > improvement to be used for Spark performance dashboards and monitoring > tools/instrumentation. > Additionally this proposes a new configuration parameter > `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to > turn on the new feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26928) Add driver CPU Time to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-26928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26928: -- Assignee: Luca Canali > Add driver CPU Time to the metrics system > - > > Key: SPARK-26928 > URL: https://issues.apache.org/jira/browse/SPARK-26928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > > This proposes to add instrumentation for the driver's JVM CPU time via the > Spark Dropwizard/Codahale metrics system. It follows directly from previous > work SPARK-25228 and shares similar motivations: it is intended as an > improvement to be used for Spark performance dashboards and monitoring > tools/instrumentation. > Additionally this proposes a new configuration parameter > `spark.metrics.cpu.time.driver.enabled` (default: false) that can be used to > turn on the new feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27043) Nested schema pruning benchmark for ORC
[ https://issues.apache.org/jira/browse/SPARK-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27043. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23955 > Nested schema pruning benchmark for ORC > --- > > Key: SPARK-27043 > URL: https://issues.apache.org/jira/browse/SPARK-27043 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We have benchmark of nested schema pruning, but only for Parquet. This adds > similar benchmark for ORC. This is used with nested schema pruning of ORC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784714#comment-16784714 ] Stavros Kontopoulos edited comment on SPARK-27063 at 3/5/19 5:53 PM: - Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire (if you dont use the local deamon to build stuff for whatever reason). Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. was (Author: skonto): Yes some other thing that I noticed is when the images are pulled this may take time and tests will expire. Also in this [PR|https://github.com/apache/spark/pull/23514] I set patience differently because some tests may run too fast for good or bad. > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata
Ryan Blue created SPARK-27067: - Summary: SPIP: Catalog API for table metadata Key: SPARK-27067 URL: https://issues.apache.org/jira/browse/SPARK-27067 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27066. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support
Ryan Blue created SPARK-27066: - Summary: SPIP: Identifiers for multi-catalog support Key: SPARK-27066 URL: https://issues.apache.org/jira/browse/SPARK-27066 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Description: Goal: Define a catalog API to create, alter, load, and drop tables > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Description: Goals: * Propose semantics for identifiers and a listing API to support multiple catalogs ** Support any namespace scheme used by an external catalog ** Avoid traversing namespaces via multiple listing calls from Spark * Outline migration from the current behavior to Spark with multiple catalogs > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > > Goals: > * Propose semantics for identifiers and a listing API to support multiple > catalogs > ** Support any namespace scheme used by an external catalog > ** Avoid traversing namespaces via multiple listing calls from Spark > * Outline migration from the current behavior to Spark with multiple catalogs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Priority: Minor (was: Major) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Minor > Labels: easyfix, pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} Actually, CatalogImpl cache table with received table name, instead of only the table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received tableName. Here is the proposed changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] and CatalogImpl register cache with received table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received table name instead. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table,
[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27065: Assignee: Apache Spark (was: Wenchen Fan) > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27065: Assignee: Wenchen Fan (was: Apache Spark) > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) CatalogImpl.refreshTable should register query in cache with received tableName
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If _CatalogImpl.refreshTable()_ method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} CatalogImpl cache table with received _tableName_, instead of _tableIdent.table_ {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received _tableName_. Here is the proposed line of changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)){code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} Actually, CatalogImpl cache table with received table name, instead of only the table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received tableName. Here is the proposed changes. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} > CatalogImpl.refreshTable should register query in cache with
[jira] [Commented] (SPARK-27065) avoid more than one active task set managers for a stage
[ https://issues.apache.org/jira/browse/SPARK-27065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784654#comment-16784654 ] Apache Spark commented on SPARK-27065: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23927 > avoid more than one active task set managers for a stage > > > Key: SPARK-27065 > URL: https://issues.apache.org/jira/browse/SPARK-27065 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27005) Design sketch: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784555#comment-16784555 ] Thomas Graves edited comment on SPARK-27005 at 3/5/19 3:40 PM: --- so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. I added comments to the google doc was (Author: tgraves): so we have both a google design doc and the comment above, can you consolidate into 1 place? the google doc might be easier to comment on. > Design sketch: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575 ] Sachin Ramachandra Setty edited comment on SPARK-27060 at 3/5/19 3:40 PM: -- I verified this issue with Spark 2.3.2 and Spark 2.4.0 versions was (Author: sachin1729): I verified this issue with 2.3.2 and 2.4.0 . > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.
[ https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782840#comment-16782840 ] Sujith Chacko edited comment on SPARK-27036 at 3/5/19 3:49 PM: --- It seems to be the problem area is BroadcastExchangeExec in driver where as part of Future a particular job will be fired and collected data will be broadcasted. The main problem is system will submit the job and its respective stage/tasks through DAGScheduler, where the scheduler thread will schedule the respective events , In BroadcastExchangeExec when future time out happens respective exception will thrown but the jobs/task which is scheduled by the DAGScheduler as part of the action called in future will not be cancelled, I think we shall cancel the respective job to avoid running the same in background even after Future time out exception, this can help to terminate the job promptly when TimeOutException happens, this will also save the additional resources getting utilized even after timeout exception thrown from driver. I want to give an attempt to handle this issue, Any comments suggestions are welcome. cc [~b...@cloudera.com] [~hvanhovell] [~srowen] was (Author: s71955): It seems to be the problem area is BroadcastExchangeExec in driver where as part of Future a particular job will be fired and collected data will be broadcasted. The main problem is system will submit the job and its respective stage/tasks through DAGScheduler, where the scheduler thread will schedule the respective events , In BroadcastExchangeExec when future time out happens respective exception will thrown but the jobs/task which is scheduled by the DAGScheduler as part of the action called in future will not be cancelled, I think we shall cancel the respective job to avoid running the same in background even after Future time out exception, this can help to terminate the job promptly when TimeOutException happens, this will also save the additional resources getting utilized even after timeout exception thrown from driver. I want to give an attempt to handle this issue, Any comments suggestions are welcome. cc [~sro...@scient.com] [~b...@cloudera.com] [~hvanhovell] > Even Broadcast thread is timed out, BroadCast Job is not aborted. > - > > Key: SPARK-27036 > URL: https://issues.apache.org/jira/browse/SPARK-27036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Babulal >Priority: Minor > Attachments: image-2019-03-04-00-38-52-401.png, > image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png > > > During broadcast table job is execution if broadcast timeout > (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till > completion whereas it should abort on broadcast timeout. > Exception is thrown in console but Spark Job is still continue. > > !image-2019-03-04-00-39-38-779.png! > !image-2019-03-04-00-39-12-210.png! > > wait for some time > !image-2019-03-04-00-38-52-401.png! > !image-2019-03-04-00-34-47-884.png! > > How to Reproduce Issue > Option1 using SQL:- > create Table t1(Big Table,1M Records) > val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')"); > create Table t2(Small Table,100K records) > val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')"); > spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false) > spark.sql("set spark.sql.broadcastTimeout=2").show(false) > Run Below Query > spark.sql("create table s using parquet as select t1.* from csv_2 as > t1,csv_1 as t2 where
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Labels: easyfix pull-request-available (was: easyfix) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Major > Labels: easyfix, pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27062: Assignee: (was: Apache Spark) > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Priority: Major > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27062: Assignee: Apache Spark > Refresh Table command register table with table name only > - > > Key: SPARK-27062 > URL: https://issues.apache.org/jira/browse/SPARK-27062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: William Wong >Assignee: Apache Spark >Priority: Major > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > If CatalogImpl.refreshTable() method is invoked against a cached table, this > method would first uncache corresponding query in the shared state cache > manager, and then cache it back to refresh the cache copy. > However, the table was recached with only 'table name'. The database name > will be missed. Therefore, if cached table is not on the default database, > the recreated cache may refer to a different table. For example, we may see > the cached table name in driver's storage page will be changed after table > refreshing. > > Here is related code on github for your reference. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] > > > {code:java} > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > val tableMetadata = > sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) > val table = sparkSession.table(tableIdent) > if (tableMetadata.tableType == CatalogTableType.VIEW) { > // Temp or persistent views: refresh (or invalidate) any metadata/data > cached > // in the plan recursively. > table.queryExecution.analyzed.refresh() > } else { > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > } > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, > blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } > {code} > > > In Spark SQL module, the database name is registered together with table name > when "CACHE TABLE" command was executed. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] > > and CatalogImpl register cache with received table name. > {code:java} > override def cacheTable(tableName: String): Unit = { > sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), > Some(tableName)) } > {code} > > Therefore, I would like to propose aligning the behavior. RefreshTable method > should reuse the received table name instead. > > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > {code} > to > {code:java} > sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611 ] Ajith S commented on SPARK-26602: - # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour) Please correct me if i am wrong cc [~srowen] > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27063: Assignee: (was: Apache Spark) > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
[ https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27063: Assignee: Apache Spark > Spark on K8S Integration Tests timeouts are too short for some test clusters > > > Key: SPARK-27063 > URL: https://issues.apache.org/jira/browse/SPARK-27063 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Minor > > As noted during development for SPARK-26729 there are a couple of integration > test timeouts that are too short when running on slower clusters e.g. > developers laptops, small CI clusters etc > [~skonto] confirmed that he has also experienced this behaviour in the > discussion on PR [PR > 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] > We should up the defaults of this timeouts as an initial step and longer term > consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution
[ https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27064: Assignee: Apache Spark (was: Wenchen Fan) > create StreamingWrite at the begining of streaming execution > > > Key: SPARK-27064 > URL: https://issues.apache.org/jira/browse/SPARK-27064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27064) create StreamingWrite at the begining of streaming execution
[ https://issues.apache.org/jira/browse/SPARK-27064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27064: Assignee: Wenchen Fan (was: Apache Spark) > create StreamingWrite at the begining of streaming execution > > > Key: SPARK-27064 > URL: https://issues.apache.org/jira/browse/SPARK-27064 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784575#comment-16784575 ] Sachin Ramachandra Setty commented on SPARK-27060: -- I verified this issue with 2.3.2 and 2.4.0 . > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM: - -Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks. > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27062) Refresh Table command register table with table name only
William Wong created SPARK-27062: Summary: Refresh Table command register table with table name only Key: SPARK-27062 URL: https://issues.apache.org/jira/browse/SPARK-27062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: William Wong If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] Therefore, I would like to propose aligning the behavior. Full table name should also be used in RefreshTable case. We should change the following line in CatalogImpl.refreshTable from {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.quotedString)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27064) create StreamingWrite at the begining of streaming execution
Wenchen Fan created SPARK-27064: --- Summary: create StreamingWrite at the begining of streaming execution Key: SPARK-27064 URL: https://issues.apache.org/jira/browse/SPARK-27064 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27065) avoid more than one active task set managers for a stage
Wenchen Fan created SPARK-27065: --- Summary: avoid more than one active task set managers for a stage Key: SPARK-27065 URL: https://issues.apache.org/jira/browse/SPARK-27065 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.0, 2.3.3 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784666#comment-16784666 ] Ajith S commented on SPARK-26727: - [~rigolaszlo] i see that from stacktrace ThriftHiveMetastore$Client is used which is a sync client for metrastore. Can you explain how you find that drop command is async.? > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319) > at >
[jira] [Updated] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27060: -- Target Version/s: (was: 2.4.0) Priority: Minor (was: Major) Fix Version/s: (was: 2.3.2) (was: 2.4.0) Don't set Fix or Target Version. This isn't my area, but I agree it seems surprising if you can create a table called "CREATE". Please post your Spark reproduction and version though. > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > Hive-Behaviour : > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:38 PM: - ~Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.~ Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): -Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27061) Expose 4040 port on driver service to access logs using service
Chandu Kavar created SPARK-27061: Summary: Expose 4040 port on driver service to access logs using service Key: SPARK-27061 URL: https://issues.apache.org/jira/browse/SPARK-27061 Project: Spark Issue Type: Task Components: Kubernetes Affects Versions: 2.4.0 Reporter: Chandu Kavar Currently, we can access the driver logs using {{kubectl port-forward 4040:4040}} mentioned in [https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui] We have users who submit spark jobs to Kubernetes, but they don't have access to the cluster. so, they can't user kubectl port-forward command. If we can expose 4040 port on driver service, we can easily relay these logs to UI using driver service and Nginx reverse proxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27062) Refresh Table command register table with table name only
[ https://issues.apache.org/jira/browse/SPARK-27062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Wong updated SPARK-27062: - Description: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] and CatalogImpl register cache with received table name. {code:java} override def cacheTable(tableName: String): Unit = { sparkSession.sharedState.cacheManager.cacheQuery(sparkSession.table(tableName), Some(tableName)) } {code} Therefore, I would like to propose aligning the behavior. RefreshTable method should reuse the received table name instead. {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) {code} to {code:java} sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableName)) {code} was: If CatalogImpl.refreshTable() method is invoked against a cached table, this method would first uncache corresponding query in the shared state cache manager, and then cache it back to refresh the cache copy. However, the table was recached with only 'table name'. The database name will be missed. Therefore, if cached table is not on the default database, the recreated cache may refer to a different table. For example, we may see the cached table name in driver's storage page will be changed after table refreshing. Here is related code on github for your reference. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala] {code:java} override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent) val table = sparkSession.table(tableIdent) if (tableMetadata.tableType == CatalogTableType.VIEW) { // Temp or persistent views: refresh (or invalidate) any metadata/data cached // in the plan recursively. table.queryExecution.analyzed.refresh() } else { // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) } // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } {code} In Spark SQL module, the database name is registered together with table name when "CACHE TABLE" command was executed. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala] Therefore, I would like to propose aligning the behavior. Full table name should also be used in RefreshTable case. We should change the following line in CatalogImpl.refreshTable from {code:java}
[jira] [Comment Edited] (SPARK-26602) Insert into table fails after querying the UDF which is loaded with wrong hdfs path
[ https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784611#comment-16784611 ] Ajith S edited comment on SPARK-26602 at 3/5/19 4:15 PM: - # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour, we get classnotfoundexception when the missing class is actually referred, until then JVM is happily running) Please correct me if i am wrong cc [~srowen] was (Author: ajithshetty): # I have a question about this issue in thrift-server case. If admin does a add jar with a non-existing jar (may be a human error), it will cause all the ongoing beeline sessions to fail ( even a query where jar is not needed at all). and only way to recover is restart of thrift-server # As you said, "If a user adds something to the classpath, it matters to the whole classpath. If it's missing, I think it's surprising to ignore that fact" - but unless the user refers to the jar, is it ok to fail all of his operations.? (just like JVM behaviour) Please correct me if i am wrong cc [~srowen] > Insert into table fails after querying the UDF which is loaded with wrong > hdfs path > --- > > Key: SPARK-26602 > URL: https://issues.apache.org/jira/browse/SPARK-26602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Haripriya >Priority: Major > Attachments: beforeFixUdf.txt > > > In sql, > 1.Query the existing udf(say myFunc1) > 2. create and select the udf registered with incorrect path (say myFunc2) > 3.Now again query the existing udf in the same session - Wil throw exception > stating that couldn't read resource of myFunc2's path > 4.Even the basic operations like insert and select will fail giving the same > error > Result: > java.lang.RuntimeException: Failed to read external resource > hdfs:///tmp/hari_notexists1/two_udfs.jar > at > org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288) > at > org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters
Rob Vesse created SPARK-27063: - Summary: Spark on K8S Integration Tests timeouts are too short for some test clusters Key: SPARK-27063 URL: https://issues.apache.org/jira/browse/SPARK-27063 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Rob Vesse As noted during development for SPARK-26729 there are a couple of integration test timeouts that are too short when running on slower clusters e.g. developers laptops, small CI clusters etc [~skonto] confirmed that he has also experienced this behaviour in the discussion on PR [PR 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938] We should up the defaults of this timeouts as an initial step and longer term consider making the timeouts themselves configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482 ] Pedro Fernandes edited comment on SPARK-23986 at 3/5/19 3:39 PM: - -Guys, is there a workaround for the folks that can't upgrade Spark version? Thanks.- Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) was (Author: pedromorfeu): ~Guys, Is there a workaround for the folks that can't upgrade Spark version? Thanks.~ Here's my workaround for, say, 10 aggregation operations: # dataframe1 = aggregations 1 to 5 # dataframe2 = aggregations 6 to 10 # dataframe1.join(dataframe2) > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org