[jira] [Commented] (SPARK-26330) Duplicate query execution events generated for SQL commands
[ https://issues.apache.org/jira/browse/SPARK-26330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790227#comment-16790227 ] Sandeep Katta commented on SPARK-26330: --- looks same as [SPARK-27114|https://issues.apache.org/jira/browse/SPARK-27114] > Duplicate query execution events generated for SQL commands > --- > > Key: SPARK-26330 > URL: https://issues.apache.org/jira/browse/SPARK-26330 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Consider the following code: > {code:java} > spark.sql("create table foo (bar int)").show() > {code} > The command is executed eagerly (i.e. before {{show()}} is called) and > generates a query execution event. But when you call {{show()}}, a duplicate > event is generated, even though Spark does not execute anything at that point. > This can be a little more misleading when you do something like a CTAS, since > the duplicate events may cause listeners to think there were multiple inserts > when that's not true. > A fuller example that shows this (and you can look at the output that both > inputs to the listener are the same): > {code:java} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val lsnr = new QueryExecutionListener() { > override def onSuccess(funcName: String, qe: QueryExecution, durationNs: > Long): Unit = { > println(s"on success: $funcName -> ${qe.analyzed}") > } > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = { > println(s"on failure: $funcName -> ${qe.analyzed}") > } > } > spark.sessionState.listenerManager.register(lsnr) > spark.sql("drop table if exists test") > val df = spark.sql("create table test(i int)") > df.show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27011) reset command fails after cache table
[ https://issues.apache.org/jira/browse/SPARK-27011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27011. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23918 [https://github.com/apache/spark/pull/23918] > reset command fails after cache table > - > > Key: SPARK-27011 > URL: https://issues.apache.org/jira/browse/SPARK-27011 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.0.0 > > > > h3. Commands to reproduce > spark-sql> create table abcde ( a int); > spark-sql> reset; // can work success > spark-sql> cache table abcde; > spark-sql> reset; //fails with exception > h3. Below is the stack > {{org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree:}} > {{ResetCommand$}}{{at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:379)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:216)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:211)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:259)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3$adapted(CacheManager.scala:236)}} > {{ at scala.collection.Iterator.find(Iterator.scala:993)}} > {{ at scala.collection.Iterator.find$(Iterator.scala:990)}} > {{ at scala.collection.AbstractIterator.find(Iterator.scala:1429)}} > {{ at scala.collection.IterableLike.find(IterableLike.scala:81)}} > {{ at scala.collection.IterableLike.find$(IterableLike.scala:80)}} > {{ at scala.collection.AbstractIterable.find(Iterable.scala:56)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$2(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.readLock(CacheManager.scala:59)}} > {{ at > org.apache.spark.sql.execution.CacheManager.lookupCachedData(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:250)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.execution.CacheManager.useCachedData(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:68)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:65)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:71)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:316)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:146)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:82)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)}} > {{ at
[jira] [Assigned] (SPARK-27011) reset command fails after cache table
[ https://issues.apache.org/jira/browse/SPARK-27011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27011: --- Assignee: Ajith S > reset command fails after cache table > - > > Key: SPARK-27011 > URL: https://issues.apache.org/jira/browse/SPARK-27011 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > > > h3. Commands to reproduce > spark-sql> create table abcde ( a int); > spark-sql> reset; // can work success > spark-sql> cache table abcde; > spark-sql> reset; //fails with exception > h3. Below is the stack > {{org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree:}} > {{ResetCommand$}}{{at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:379)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:216)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:211)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:259)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3$adapted(CacheManager.scala:236)}} > {{ at scala.collection.Iterator.find(Iterator.scala:993)}} > {{ at scala.collection.Iterator.find$(Iterator.scala:990)}} > {{ at scala.collection.AbstractIterator.find(Iterator.scala:1429)}} > {{ at scala.collection.IterableLike.find(IterableLike.scala:81)}} > {{ at scala.collection.IterableLike.find$(IterableLike.scala:80)}} > {{ at scala.collection.AbstractIterable.find(Iterable.scala:56)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$2(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.readLock(CacheManager.scala:59)}} > {{ at > org.apache.spark.sql.execution.CacheManager.lookupCachedData(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:250)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.execution.CacheManager.useCachedData(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:68)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:65)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:71)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:316)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:146)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:82)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)}} > {{ at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3346)}} > {{ at org.apache.spark.sql.Dataset.(Dataset.scala:203)}} > {{ at
[jira] [Commented] (SPARK-27011) reset command fails after cache table
[ https://issues.apache.org/jira/browse/SPARK-27011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790220#comment-16790220 ] Ajith S commented on SPARK-27011: - @ [~cloud_fan] As [https://github.com/apache/spark/pull/23918] is merged, can we close this.? > reset command fails after cache table > - > > Key: SPARK-27011 > URL: https://issues.apache.org/jira/browse/SPARK-27011 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Ajith S >Priority: Minor > > > h3. Commands to reproduce > spark-sql> create table abcde ( a int); > spark-sql> reset; // can work success > spark-sql> cache table abcde; > spark-sql> reset; //fails with exception > h3. Below is the stack > {{org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree:}} > {{ResetCommand$}}{{at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:379)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:216)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:211)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:259)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$3$adapted(CacheManager.scala:236)}} > {{ at scala.collection.Iterator.find(Iterator.scala:993)}} > {{ at scala.collection.Iterator.find$(Iterator.scala:990)}} > {{ at scala.collection.AbstractIterator.find(Iterator.scala:1429)}} > {{ at scala.collection.IterableLike.find(IterableLike.scala:81)}} > {{ at scala.collection.IterableLike.find$(IterableLike.scala:80)}} > {{ at scala.collection.AbstractIterable.find(Iterable.scala:56)}} > {{ at > org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$2(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager.readLock(CacheManager.scala:59)}} > {{ at > org.apache.spark.sql.execution.CacheManager.lookupCachedData(CacheManager.scala:236)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:250)}} > {{ at > org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)}} > {{ at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)}} > {{ at > org.apache.spark.sql.execution.CacheManager.useCachedData(CacheManager.scala:241)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:68)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:65)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:72)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:71)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:316)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:139)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:146)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:82)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)}} > {{ at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)}} > {{ at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3346)}} > {{ at
[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790208#comment-16790208 ] Ajith S commented on SPARK-26961: - [~srowen] Yes. I too have same opinion of fixing it via registerAsParallelCapable. Will raise a PR for this [~xsapphire] i think these class loaders are child classloaders of LaunchAppClassLoader which already has classes for jar in class path. So overhead may not be of higher magnitude > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Priority: Major > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at >
[jira] [Resolved] (SPARK-27117) current_date/current_timestamp should not refer to columns with ansi parser mode
[ https://issues.apache.org/jira/browse/SPARK-27117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27117. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24039 [https://github.com/apache/spark/pull/24039] > current_date/current_timestamp should not refer to columns with ansi parser > mode > > > Key: SPARK-27117 > URL: https://issues.apache.org/jira/browse/SPARK-27117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27114) SQL Tab shows duplicate executions for some commands
[ https://issues.apache.org/jira/browse/SPARK-27114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790136#comment-16790136 ] Ajith S commented on SPARK-27114: - [~srowen] as *LocalRelation* is eagerly evaluated hence can skip second evaluation, i suppose its not executed on second time as it will throw a exception(in this case table already exists). Currently it will use two execution IDs and it will fire a duplicate *SparkListenerSQLExecutionStart* event. This cause app store to record a duplicate event and hence it shows up in UI twice > SQL Tab shows duplicate executions for some commands > > > Key: SPARK-27114 > URL: https://issues.apache.org/jira/browse/SPARK-27114 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Minor > Attachments: Screenshot from 2019-03-09 14-04-07.png > > > run simple sql command > {{create table abc ( a int );}} > Open SQL tab in SparkUI, we can see duplicate entries for the execution. > Tested behaviour in thriftserver and sparksql > *check attachment* > The Problem seems be due to eager execution of commands @ > org.apache.spark.sql.Dataset#logicalPlan > After analysis for spark-sql, the call stacks for duplicate execution id > seems to be > {code:java} > $anonfun$withNewExecutionId$1:78, SQLExecution$ > (org.apache.spark.sql.execution) > apply:-1, 2057192703 > (org.apache.spark.sql.execution.SQLExecution$$$Lambda$1036) > withSQLConfPropagated:147, SQLExecution$ (org.apache.spark.sql.execution) > withNewExecutionId:74, SQLExecution$ (org.apache.spark.sql.execution) > withAction:3346, Dataset (org.apache.spark.sql) > :203, Dataset (org.apache.spark.sql) > ofRows:88, Dataset$ (org.apache.spark.sql) > sql:656, SparkSession (org.apache.spark.sql) > sql:685, SQLContext (org.apache.spark.sql) > run:63, SparkSQLDriver (org.apache.spark.sql.hive.thriftserver) > processCmd:372, SparkSQLCLIDriver (org.apache.spark.sql.hive.thriftserver) > processLine:376, CliDriver (org.apache.hadoop.hive.cli) > main:275, SparkSQLCLIDriver$ (org.apache.spark.sql.hive.thriftserver) > main:-1, SparkSQLCLIDriver (org.apache.spark.sql.hive.thriftserver) > invoke0:-1, NativeMethodAccessorImpl (sun.reflect) > invoke:62, NativeMethodAccessorImpl (sun.reflect) > invoke:43, DelegatingMethodAccessorImpl (sun.reflect) > invoke:498, Method (java.lang.reflect) > start:52, JavaMainApplication (org.apache.spark.deploy) > org$apache$spark$deploy$SparkSubmit$$runMain:855, SparkSubmit > (org.apache.spark.deploy) > doRunMain$1:162, SparkSubmit (org.apache.spark.deploy) > submit:185, SparkSubmit (org.apache.spark.deploy) > doSubmit:87, SparkSubmit (org.apache.spark.deploy) > doSubmit:934, SparkSubmit$$anon$2 (org.apache.spark.deploy) > main:943, SparkSubmit$ (org.apache.spark.deploy) > main:-1, SparkSubmit (org.apache.spark.deploy){code} > {code:java} > $anonfun$withNewExecutionId$1:78, SQLExecution$ > (org.apache.spark.sql.execution) > apply:-1, 2057192703 > (org.apache.spark.sql.execution.SQLExecution$$$Lambda$1036) > withSQLConfPropagated:147, SQLExecution$ (org.apache.spark.sql.execution) > withNewExecutionId:74, SQLExecution$ (org.apache.spark.sql.execution) > run:65, SparkSQLDriver (org.apache.spark.sql.hive.thriftserver) > processCmd:372, SparkSQLCLIDriver (org.apache.spark.sql.hive.thriftserver) > processLine:376, CliDriver (org.apache.hadoop.hive.cli) > main:275, SparkSQLCLIDriver$ (org.apache.spark.sql.hive.thriftserver) > main:-1, SparkSQLCLIDriver (org.apache.spark.sql.hive.thriftserver) > invoke0:-1, NativeMethodAccessorImpl (sun.reflect) > invoke:62, NativeMethodAccessorImpl (sun.reflect) > invoke:43, DelegatingMethodAccessorImpl (sun.reflect) > invoke:498, Method (java.lang.reflect) > start:52, JavaMainApplication (org.apache.spark.deploy) > org$apache$spark$deploy$SparkSubmit$$runMain:855, SparkSubmit > (org.apache.spark.deploy) > doRunMain$1:162, SparkSubmit (org.apache.spark.deploy) > submit:185, SparkSubmit (org.apache.spark.deploy) > doSubmit:87, SparkSubmit (org.apache.spark.deploy) > doSubmit:934, SparkSubmit$$anon$2 (org.apache.spark.deploy) > main:943, SparkSubmit$ (org.apache.spark.deploy) > main:-1, SparkSubmit (org.apache.spark.deploy){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790119#comment-16790119 ] Xianyin Xin commented on SPARK-21067: - Yep, [~Moriarty279] , nice analysis. > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790083#comment-16790083 ] Hyukjin Kwon commented on SPARK-27124: -- {{SchemaConverters}} isn't an API as you said. I think it's a bit odd that we document this specific internal class can be used via Python. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter
[ https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27109. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24030 [https://github.com/apache/spark/pull/24030] > Refactoring of TimestampFormatter and DateFormatter > --- > > Key: SPARK-27109 > URL: https://issues.apache.org/jira/browse/SPARK-27109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > * Date/TimestampFormatter converts parsed input to Instant before converting > it to days/micros. This is unnecessary conversion because seconds and > fraction of second can be extracted (calculated) from ZoneDateTime directly > * Avoid additional extraction of TemporalQueries.localTime from > temporalAccessor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter
[ https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27109: - Assignee: Maxim Gekk > Refactoring of TimestampFormatter and DateFormatter > --- > > Key: SPARK-27109 > URL: https://issues.apache.org/jira/browse/SPARK-27109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > * Date/TimestampFormatter converts parsed input to Instant before converting > it to days/micros. This is unnecessary conversion because seconds and > fraction of second can be extracted (calculated) from ZoneDateTime directly > * Avoid additional extraction of TemporalQueries.localTime from > temporalAccessor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26921) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26921: - Description: Arrow optimization was added but Arrow is not available on CRAN. So, it had to add some hacks to avoid CRAN check in SparkR side. For example, see https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 These should be removed to properly check CRAN in SparkR See also ARROW-3204 was: Arrow optimization was added but Arrow is not available on CRAN. So, it had to add some hacks to avoid CRAN check in SparkR side. For example, see https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 These should be removed to properly check CRAN in SparkR > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26921 > URL: https://issues.apache.org/jira/browse/SPARK-26921 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base
[ https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26923. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23977 [https://github.com/apache/spark/pull/23977] > Refactor ArrowRRunner and RRunner to share the same base > > > Key: SPARK-26923 > URL: https://issues.apache.org/jira/browse/SPARK-26923 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > ArrowRRunner and RRunner has already duplicated codes. We should refactor and > deduplicate them. Also, ArrowRRunner happened to have a rather hacky code > (see > https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61 > ). > We might even be able to deduplicate some codes with PythonRunners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26923) Refactor ArrowRRunner and RRunner to share the same base
[ https://issues.apache.org/jira/browse/SPARK-26923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26923: Assignee: Hyukjin Kwon > Refactor ArrowRRunner and RRunner to share the same base > > > Key: SPARK-26923 > URL: https://issues.apache.org/jira/browse/SPARK-26923 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > ArrowRRunner and RRunner has already duplicated codes. We should refactor and > deduplicate them. Also, ArrowRRunner happened to have a rather hacky code > (see > https://github.com/apache/spark/pull/23787/files#diff-a0b6a11cc2e2299455c795fe3c96b823R61 > ). > We might even be able to deduplicate some codes with PythonRunners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26594) DataSourceOptions.asMap should return CaseInsensitiveMap
[ https://issues.apache.org/jira/browse/SPARK-26594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26594: Assignee: (was: Apache Spark) > DataSourceOptions.asMap should return CaseInsensitiveMap > > > Key: SPARK-26594 > URL: https://issues.apache.org/jira/browse/SPARK-26594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > I'm pretty surprised that the following codes will fail. > {code} > import scala.collection.JavaConverters._ > import org.apache.spark.sql.sources.v2.DataSourceOptions > val map = new DataSourceOptions(Map("fooBar" -> "x").asJava).asMap > assert(map.get("fooBar") == "x") > {code} > It's better to make DataSourceOptions.asMap return CaseInsensitiveMap. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26594) DataSourceOptions.asMap should return CaseInsensitiveMap
[ https://issues.apache.org/jira/browse/SPARK-26594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26594: Assignee: Apache Spark > DataSourceOptions.asMap should return CaseInsensitiveMap > > > Key: SPARK-26594 > URL: https://issues.apache.org/jira/browse/SPARK-26594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > I'm pretty surprised that the following codes will fail. > {code} > import scala.collection.JavaConverters._ > import org.apache.spark.sql.sources.v2.DataSourceOptions > val map = new DataSourceOptions(Map("fooBar" -> "x").asJava).asMap > assert(map.get("fooBar") == "x") > {code} > It's better to make DataSourceOptions.asMap return CaseInsensitiveMap. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27123) Improve CollapseProject to handle projects cross limit/repartition/sample
[ https://issues.apache.org/jira/browse/SPARK-27123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27123: - Assignee: Dongjoon Hyun > Improve CollapseProject to handle projects cross limit/repartition/sample > - > > Key: SPARK-27123 > URL: https://issues.apache.org/jira/browse/SPARK-27123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > `CollapseProject` optimizer simplifies the plan by merging the adjacent > projects and performing alias substitution. > {code:java} > scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain > == Physical Plan == > *(1) Project [a#5 AS c#1] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > We can do that more complex cases like the following. > *BEFORE* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > *(2) Project [b#0 AS c#1] > +- Exchange RoundRobinPartitioning(1) >+- *(1) Project [a#5 AS b#0] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > *AFTER* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > Exchange RoundRobinPartitioning(1) > +- *(1) Project [a#11 AS c#7] >+- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26958) Add NestedSchemaPruningBenchmark
[ https://issues.apache.org/jira/browse/SPARK-26958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-26958: - Assignee: Dongjoon Hyun > Add NestedSchemaPruningBenchmark > > > Key: SPARK-26958 > URL: https://issues.apache.org/jira/browse/SPARK-26958 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This adds `NestedSchemaPruningBenchmark` to verify the going PR performance > benefits clearly and to prevent the future performance degradation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27123) Improve CollapseProject to handle projects cross limit/repartition/sample
[ https://issues.apache.org/jira/browse/SPARK-27123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27123: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-25603 > Improve CollapseProject to handle projects cross limit/repartition/sample > - > > Key: SPARK-27123 > URL: https://issues.apache.org/jira/browse/SPARK-27123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > `CollapseProject` optimizer simplifies the plan by merging the adjacent > projects and performing alias substitution. > {code:java} > scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain > == Physical Plan == > *(1) Project [a#5 AS c#1] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > We can do that more complex cases like the following. > *BEFORE* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > *(2) Project [b#0 AS c#1] > +- Exchange RoundRobinPartitioning(1) >+- *(1) Project [a#5 AS b#0] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > *AFTER* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > Exchange RoundRobinPartitioning(1) > +- *(1) Project [a#11 AS c#7] >+- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26822) Upgrade the deprecated module 'optparse'
[ https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790018#comment-16790018 ] Thincrs commented on SPARK-26822: - A user of thincrs has selected this issue. Deadline: Mon, Mar 18, 2019 10:19 PM > Upgrade the deprecated module 'optparse' > > > Key: SPARK-26822 > URL: https://issues.apache.org/jira/browse/SPARK-26822 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.4.0 >Reporter: Neo Chien >Assignee: Neo Chien >Priority: Minor > Labels: pull-request-available, test > Fix For: 3.0.0 > > > Follow the [official > document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code] > to upgrade the deprecated module 'optparse' to 'argparse'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27073) Fix a race condition when handling of IdleStateEvent
[ https://issues.apache.org/jira/browse/SPARK-27073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27073. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23989 [https://github.com/apache/spark/pull/23989] > Fix a race condition when handling of IdleStateEvent > > > Key: SPARK-27073 > URL: https://issues.apache.org/jira/browse/SPARK-27073 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.4.0 >Reporter: dzcxzl >Priority: Minor > Fix For: 3.0.0 > > > When TransportChannelHandler processes IdleStateEvent, it first calculates > whether the last request time has timed out. > At this time, TransportClient.sendRpc initiates a request. > TransportChannelHandler gets responseHandler.numOutstandingRequests() > 0, > causing the normal connection to be closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed
[ https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27004. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24033 [https://github.com/apache/spark/pull/24033] > Code for https uri authentication in Spark Submit needs to be removed > - > > Key: SPARK-27004 > URL: https://issues.apache.org/jira/browse/SPARK-27004 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 3.0.0 > > > The old code in Spark Submit used for uri verification according to the > comments > [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and > [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] > needs to be removed or refactored otherwise it will cause failures with > secure http uris. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed
[ https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27004: -- Assignee: Marcelo Vanzin > Code for https uri authentication in Spark Submit needs to be removed > - > > Key: SPARK-27004 > URL: https://issues.apache.org/jira/browse/SPARK-27004 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Marcelo Vanzin >Priority: Minor > > The old code in Spark Submit used for uri verification according to the > comments > [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and > [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] > needs to be removed or refactored otherwise it will cause failures with > secure http uris. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23160) Add more window sql tests
[ https://issues.apache.org/jira/browse/SPARK-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789877#comment-16789877 ] Dylan Guedes commented on SPARK-23160: -- Hi, I would like to work on this one, but to be fair I didn't get the meaning of "tests in other major databases". [~jiangxb1987] do you remember what scenarios you had in mind? > Add more window sql tests > - > > Key: SPARK-23160 > URL: https://issues.apache.org/jira/browse/SPARK-23160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xingbo Jiang >Priority: Minor > > We should also cover the window sql interface, example in > `sql/core/src/test/resources/sql-tests/inputs/window.sql`, it should also be > funny to see whether we can generate consistent results for window tests in > other major databases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27129) Add JSON event serialization methods in JSONUtils for blacklisted events
[ https://issues.apache.org/jira/browse/SPARK-27129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27129: Assignee: (was: Apache Spark) > Add JSON event serialization methods in JSONUtils for blacklisted events > > > Key: SPARK-27129 > URL: https://issues.apache.org/jira/browse/SPARK-27129 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shahid K I >Priority: Minor > > Add JSON event serialization methods in JSONUtils for blacklisted events -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27129) Add JSON event serialization methods in JSONUtils for blacklisted events
[ https://issues.apache.org/jira/browse/SPARK-27129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27129. Resolution: Invalid Unless you can show why this (or in this case the lack of this) is a problem, there is nothing to do here. > Add JSON event serialization methods in JSONUtils for blacklisted events > > > Key: SPARK-27129 > URL: https://issues.apache.org/jira/browse/SPARK-27129 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shahid K I >Priority: Minor > > Add JSON event serialization methods in JSONUtils for blacklisted events -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27129) Add JSON event serialization methods in JSONUtils for blacklisted events
[ https://issues.apache.org/jira/browse/SPARK-27129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27129: Assignee: Apache Spark > Add JSON event serialization methods in JSONUtils for blacklisted events > > > Key: SPARK-27129 > URL: https://issues.apache.org/jira/browse/SPARK-27129 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Shahid K I >Assignee: Apache Spark >Priority: Minor > > Add JSON event serialization methods in JSONUtils for blacklisted events -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27129) Add JSON event serialization methods in JSONUtils for blacklisted events
Shahid K I created SPARK-27129: -- Summary: Add JSON event serialization methods in JSONUtils for blacklisted events Key: SPARK-27129 URL: https://issues.apache.org/jira/browse/SPARK-27129 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Shahid K I Add JSON event serialization methods in JSONUtils for blacklisted events -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789829#comment-16789829 ] Mi Zi commented on SPARK-26961: --- I think the problem could be fixed from a more general prospect if there is a way to read Configuration without updating its content. That should help to ensure accessing Configuration object without triggering class loader. I'm neither familiar with Hadoop nor Spark. So I don't know if that's a practical solution or not. I don't know how far will registerAsParallelCapable affect the whole system. Since it will cause storing one lock for each loaded class. I'm not sure how large will that overhead be. > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Priority: Major > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at >
[jira] [Commented] (SPARK-27093) Honor ParseMode in AvroFileFormat
[ https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789816#comment-16789816 ] Tim Cerexhe commented on SPARK-27093: - We have three distinct failure modes that we need to not throw: * corrupt avro file due to bad signal data, eg. file is truncated / throws {{EOFException}} before parsing complete. This can be reproduced by truncating a valid avro file * corrupt avro file due to invalid schema header, eg. corrupted over network. I believe this can be reproduced by opening in hexedit and writing zeros over some of the json header keys ;) * incompatible avro schema (eg. {{org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro to catalyst because schema at path rec.xxx is not compatible (avroType = \{"type":"array","items":"float"}, sqlType = StructType( ...}}) {{ignoreCorruptFiles}} successfully squashes the first error, but not the second or third schema-related errors. I'll see if I can get a release on our test files, otherwise I'll generate new ones for you. > Honor ParseMode in AvroFileFormat > - > > Key: SPARK-27093 > URL: https://issues.apache.org/jira/browse/SPARK-27093 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Tim Cerexhe >Priority: Major > > The Avro reader is missing the ability to handle malformed or truncated files > like the JSON reader. Currently it throws exceptions when it encounters any > bad or truncated record in an Avro file, causing the entire Spark job to fail > from a single dodgy file. > Ideally the AvroFileFormat would accept a Permissive or DropMalformed > ParseMode like Spark's JSON format. This would enable the the Avro reader to > drop bad records and continue processing the good records rather than abort > the entire job. > Obviously the default could remain as FailFastMode, which is the current > effective behavior, so this wouldn’t break any existing users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27061) Expose Driver UI port on driver service to access UI using service
[ https://issues.apache.org/jira/browse/SPARK-27061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27061: -- Assignee: Chandu Kavar > Expose Driver UI port on driver service to access UI using service > -- > > Key: SPARK-27061 > URL: https://issues.apache.org/jira/browse/SPARK-27061 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Chandu Kavar >Assignee: Chandu Kavar >Priority: Minor > Labels: Kubernetes > Fix For: 3.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, we can access the driver logs using > {{kubectl port-forward 4040:4040}} > mentioned in > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui] > We have users who submit spark jobs to Kubernetes, but they don't have access > to the cluster. so, they can't user kubectl port-forward command. > If we can expose 4040 port on driver service, we can easily relay these logs > to UI using driver service and Nginx reverse proxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27061) Expose Driver UI port on driver service to access UI using service
[ https://issues.apache.org/jira/browse/SPARK-27061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27061. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23990 [https://github.com/apache/spark/pull/23990] > Expose Driver UI port on driver service to access UI using service > -- > > Key: SPARK-27061 > URL: https://issues.apache.org/jira/browse/SPARK-27061 > Project: Spark > Issue Type: Task > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Chandu Kavar >Priority: Minor > Labels: Kubernetes > Fix For: 3.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, we can access the driver logs using > {{kubectl port-forward 4040:4040}} > mentioned in > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#accessing-driver-ui] > We have users who submit spark jobs to Kubernetes, but they don't have access > to the cluster. so, they can't user kubectl port-forward command. > If we can expose 4040 port on driver service, we can easily relay these logs > to UI using driver service and Nginx reverse proxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789786#comment-16789786 ] Gabor Somogyi commented on SPARK-26998: --- Same understanding, chosen the file approach. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188
[ https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-26990. --- Resolution: Fixed > Difference in handling of mixed-case partition column names after SPARK-26188 > - > > Key: SPARK-26990 > URL: https://issues.apache.org/jira/browse/SPARK-26990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Bruce Robbins >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > I noticed that the [PR for > SPARK-26188|https://github.com/apache/spark/pull/23165] changed how > mixed-cased partition columns are handled when the user provides a schema. > Say I have this file structure (note that each instance of `pS` is mixed > case): > {noformat} > bash-3.2$ find partitioned5 -type d > partitioned5 > partitioned5/pi=2 > partitioned5/pi=2/pS=foo > partitioned5/pi=2/pS=bar > partitioned5/pi=1 > partitioned5/pi=1/pS=foo > partitioned5/pi=1/pS=bar > bash-3.2$ > {noformat} > If I load the file with a user-provided schema in 2.4 (before the PR was > committed) or 2.3, I see: > {noformat} > scala> val df = spark.read.schema("intField int, pi int, ps > string").parquet("partitioned5") > df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] > scala> df.printSchema > root > |-- intField: integer (nullable = true) > |-- pi: integer (nullable = true) > |-- ps: string (nullable = true) > scala> > {noformat} > However, using 2.4 after the PR was committed. I see: > {noformat} > scala> val df = spark.read.schema("intField int, pi int, ps > string").parquet("partitioned5") > df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] > scala> df.printSchema > root > |-- intField: integer (nullable = true) > |-- pi: integer (nullable = true) > |-- pS: string (nullable = true) > scala> > {noformat} > Spark is picking up the mixed-case column name {{pS}} from the directory > name, not the lower-case {{ps}} from my specified schema. > In all tests, {{spark.sql.caseSensitive}} is set to the default (false). > Not sure is this is an bug, but it is a difference. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188
[ https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-26990: -- Fix Version/s: 2.4.1 > Difference in handling of mixed-case partition column names after SPARK-26188 > - > > Key: SPARK-26990 > URL: https://issues.apache.org/jira/browse/SPARK-26990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Bruce Robbins >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > I noticed that the [PR for > SPARK-26188|https://github.com/apache/spark/pull/23165] changed how > mixed-cased partition columns are handled when the user provides a schema. > Say I have this file structure (note that each instance of `pS` is mixed > case): > {noformat} > bash-3.2$ find partitioned5 -type d > partitioned5 > partitioned5/pi=2 > partitioned5/pi=2/pS=foo > partitioned5/pi=2/pS=bar > partitioned5/pi=1 > partitioned5/pi=1/pS=foo > partitioned5/pi=1/pS=bar > bash-3.2$ > {noformat} > If I load the file with a user-provided schema in 2.4 (before the PR was > committed) or 2.3, I see: > {noformat} > scala> val df = spark.read.schema("intField int, pi int, ps > string").parquet("partitioned5") > df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] > scala> df.printSchema > root > |-- intField: integer (nullable = true) > |-- pi: integer (nullable = true) > |-- ps: string (nullable = true) > scala> > {noformat} > However, using 2.4 after the PR was committed. I see: > {noformat} > scala> val df = spark.read.schema("intField int, pi int, ps > string").parquet("partitioned5") > df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field] > scala> df.printSchema > root > |-- intField: integer (nullable = true) > |-- pi: integer (nullable = true) > |-- pS: string (nullable = true) > scala> > {noformat} > Spark is picking up the mixed-case column name {{pS}} from the directory > name, not the lower-case {{ps}} from my specified schema. > In all tests, {{spark.sql.caseSensitive}} is set to the default (false). > Not sure is this is an bug, but it is a difference. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789766#comment-16789766 ] Marcelo Vanzin commented on SPARK-26998: There are 3 ways to solve this: pipe, file, or env variable. Pick one. > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27128) Migrate JSON to File Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-27128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27128: Assignee: (was: Apache Spark) > Migrate JSON to File Data Source V2 > --- > > Key: SPARK-27128 > URL: https://issues.apache.org/jira/browse/SPARK-27128 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27119) Do not infer schema when reading Hive serde table with native data source
[ https://issues.apache.org/jira/browse/SPARK-27119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27119. - Resolution: Fixed Fix Version/s: 3.0.0 > Do not infer schema when reading Hive serde table with native data source > - > > Key: SPARK-27119 > URL: https://issues.apache.org/jira/browse/SPARK-27119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27128) Migrate JSON to File Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-27128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27128: Assignee: Apache Spark > Migrate JSON to File Data Source V2 > --- > > Key: SPARK-27128 > URL: https://issues.apache.org/jira/browse/SPARK-27128 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27128) Migrate JSON to File Data Source V2
Gengliang Wang created SPARK-27128: -- Summary: Migrate JSON to File Data Source V2 Key: SPARK-27128 URL: https://issues.apache.org/jira/browse/SPARK-27128 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26839: Assignee: Apache Spark > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26839: Assignee: (was: Apache Spark) > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26860) Improve RangeBetween docs in Pyspark, SparkR
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789703#comment-16789703 ] Jagadesh Kiran N commented on SPARK-26860: -- Thanks [~srowen] for committing the same > Improve RangeBetween docs in Pyspark, SparkR > > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Assignee: Jagadesh Kiran N >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789706#comment-16789706 ] Sean Owen commented on SPARK-26839: --- I'll open my WIP PR on this. I am not sure datanucleus is the issue after all. What is an issue for sure is IsolatedClientLoader. In "builtin" mode, it tries to get all the JARs from the current ClassLoader. These aren't available anymore in Java 9+. I tried just avoiding making a new URLClassLoader with these JARs in this case, but got some more CNFEs. I may need some extra eyes on this. PR coming. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dhruve Ashar updated SPARK-27107: - Description: The issue occurs while trying to read ORC data and setting the SearchArgument. {code:java} Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9 Serialization trace: literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) at com.esotericsoftware.kryo.io.Output.require(Output.java:163) at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) at scala.Option.foreach(Option.scala:257) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) at
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789668#comment-16789668 ] Dhruve Ashar commented on SPARK-27107: -- This is something we can consistently reproduce every single time. I do not have an example with company details stripped off that I can share for this use case at this point. Is there anything specific that you are looking for? I can try to come up with a reproducible case, but it seems to be difficult to reproduce as this is dependent on user query and the parameters that are being passed to filter the data. {quote} Could you provide a reproducible test case here? {quote} The Hive default is 4K and not 4M (that was a typo). Thanks for correcting that. {quote}BTW, the Hive default is 4K instead of 4M, isn't it? {quote} Yes. The hive implementation should fail when it exceeds the 10M limit for a SArg and the PR that I have against the Orc implementation tries to make this configurable so that spark can control the buffer size if we hit a buffer overflow error. {quote}Technically, Hive implementation also fails when it exceeds the limitation because it's a non-configurable parameter issue. {quote} > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at >
[jira] [Comment Edited] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789653#comment-16789653 ] Gabor Somogyi edited comment on SPARK-26998 at 3/11/19 2:47 PM: Since the first part of the PR solved (http URLs in case of secure mode) continuing with the second issue. In my view the problem can be mitigated to ask users to provide configuration parameters in configuration file (several commercial products does this) * Either spark-defaults.conf * or --properties-file That way the command line options will show either nothing (spark-defaults.conf picked up by default) or something like "... --properties-file my-secret-spark-properties.conf ...". As a side note this workaround is available at the moment but I would like to warn users for such situations. The other approach what I've considered (and abandoned) is to open a pipe and send the password through this channel but since this approach is not really conform with Spark's configuration system it would imply heavy changes and don't see the return of investment. [~vanzin] what do you think since you have quite a bit experience with security? was (Author: gsomogyi): Since the first part of the PR solved (http URLs in case of secure mode) continuing with the second issue. In my view the problem can be mitigated to ask users to provide configuration parameters either in configuration file (several commercial products does this) * Either spark-defaults.conf * or --properties-file That way the command line options will show either nothing (spark-defaults.conf picked up by default) or something like "... --properties-file my-secret-spark-properties.conf ...". As a side note this workaround is available at the moment but I would like to warn users for such situations. The other approach what I've considered (and abandoned) is to open a pipe and send the password through this channel but since this approach is not really conform with Spark's configuration system it would imply heavy changes and don't see the return of investment. [~vanzin] what do you think since you have quite a bit experience with security? > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789653#comment-16789653 ] Gabor Somogyi commented on SPARK-26998: --- Since the first part of the PR solved (http URLs in case of secure mode) continuing with the second issue. In my view the problem can be mitigated to ask users to provide configuration parameters either in configuration file (several commercial products does this) * Either spark-defaults.conf * or --properties-file That way the command line options will show either nothing (spark-defaults.conf picked up by default) or something like "... --properties-file my-secret-spark-properties.conf ...". As a side note this workaround is available at the moment but I would like to warn users for such situations. The other approach what I've considered (and abandoned) is to open a pipe and send the password through this channel but since this approach is not really conform with Spark's configuration system it would imply heavy changes and don't see the return of investment. [~vanzin] what do you think since you have quite a bit experience with security? > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Ricard updated SPARK-21067: --- Affects Version/s: 2.4.0 > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789589#comment-16789589 ] Mori[A]rty edited comment on SPARK-21067 at 3/11/19 2:04 PM: - Spark 2.4.0 still has the same problem. And I want to elaborate a bit more about how this bug happens. To reproduce this bug, we have to set hive.server2.enable.doAs to true in hive-site.xml. It causes SparkThriftServer to execute SparkExecuteStatementOperation with org.apache.hive.service.cli.session.HiveSessionImplwithUGI#sessionUgi. When the first time a CTAS statement is executed, HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs is initialized using ugi from HiveSessionImplwithUGI#sessionUgi. HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs will be closed when session closes. Since HiveClientImpl is shared among SparkThriftServer and HiveClientImpl#state#hdfsEncryptionShim won't be initialized again, subsequent usage of HiveClientImpl#state#hdfsEncryptionShim will throw "java.io.IOException: Filesystem closed". Here is a simpler way to reproduce: 1. Open Session1 with beeline and execute SQL: {code:java} CREATE TABLE tbl (i int); INSERT INTO tbl SELECT 1;{code} 2. Close Session1 3. Open Session2 with beeline and execute SQL: {code:java} INSERT INTO tbl SELECT 1; {code} 4. Get exception in Session2: {code:java} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://xxx/user/hive/warehouse/tbl/.hive-staging_hive_2019-03-11_20-22-53_627_4903214085803350096-2/-ext-1/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000 to destination hdfs://xxx/user/hive/warehouse/tbl/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000; (state=,code=0){code} By the way, [~xinxianyin]'s patch works for me and I think it is the simplest to solve this issue. was (Author: moriarty279): Spark 2.4.0 still has the same problem. And I want to elaborate a bit more about how this bug happened. To reproduce this bug, we have to set hive.server2.enable.doAs to true in hive-site.xml. It causes SparkThriftServer to execute SparkExecuteStatementOperation with org.apache.hive.service.cli.session.HiveSessionImplwithUGI#sessionUgi. When the first time a CTAS statement is executed, HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs is initialized using ugi from HiveSessionImplwithUGI#sessionUgi. HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs will be closed when session closes. Since HiveClientImpl is shared among SparkThriftServer and HiveClientImpl#state#hdfsEncryptionShim won't be initialized again, subsequent usage of HiveClientImpl#state#hdfsEncryptionShim will throw "java.io.IOException: Filesystem closed". Here are the most simple steps to reproduce: 1. Open Session1 with beeline and execute SQL: {code:java} CREATE TABLE tbl (i int); INSERT INTO tbl SELECT 1;{code} 2. Close Session1 3. Open Session2 with beeline and execute SQL: {code:java} INSERT INTO tbl SELECT 1; {code} 4. Get exception in Session2: {code:java} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://xxx/user/hive/warehouse/tbl/.hive-staging_hive_2019-03-11_20-22-53_627_4903214085803350096-2/-ext-1/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000 to destination hdfs://xxx/user/hive/warehouse/tbl/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000; (state=,code=0) {code} > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}}
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789589#comment-16789589 ] Mori[A]rty commented on SPARK-21067: Spark 2.4.0 still has the same problem. And I want to elaborate a bit more about how this bug happened. To reproduce this bug, we have to set hive.server2.enable.doAs to true in hive-site.xml. It causes SparkThriftServer to execute SparkExecuteStatementOperation with org.apache.hive.service.cli.session.HiveSessionImplwithUGI#sessionUgi. When the first time a CTAS statement is executed, HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs is initialized using ugi from HiveSessionImplwithUGI#sessionUgi. HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs will be closed when session closes. Since HiveClientImpl is shared among SparkThriftServer and HiveClientImpl#state#hdfsEncryptionShim won't be initialized again. Subsequent usage of HiveClientImpl#state#hdfsEncryptionShim will throw "java.io.IOException: Filesystem closed". Here are the most simple steps to reproduce: 1. Open Session1 with beeline and execute SQL: {code:java} CREATE TABLE tbl (i int); INSERT INTO tbl select 1;{code} 2. Close Session1 3. Open Session2 with beeline and execute SQL: {code:java} INSERT INTO tbl select 1; {code} 4. Get exception in Session2: {code:java} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://xxx/user/hive/warehouse/tbl/.hive-staging_hive_2019-03-11_20-22-53_627_4903214085803350096-2/-ext-1/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000 to destination hdfs://xxx/user/hive/warehouse/tbl/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000; (state=,code=0) {code} > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation
[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789589#comment-16789589 ] Mori[A]rty edited comment on SPARK-21067 at 3/11/19 1:57 PM: - Spark 2.4.0 still has the same problem. And I want to elaborate a bit more about how this bug happened. To reproduce this bug, we have to set hive.server2.enable.doAs to true in hive-site.xml. It causes SparkThriftServer to execute SparkExecuteStatementOperation with org.apache.hive.service.cli.session.HiveSessionImplwithUGI#sessionUgi. When the first time a CTAS statement is executed, HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs is initialized using ugi from HiveSessionImplwithUGI#sessionUgi. HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs will be closed when session closes. Since HiveClientImpl is shared among SparkThriftServer and HiveClientImpl#state#hdfsEncryptionShim won't be initialized again, subsequent usage of HiveClientImpl#state#hdfsEncryptionShim will throw "java.io.IOException: Filesystem closed". Here are the most simple steps to reproduce: 1. Open Session1 with beeline and execute SQL: {code:java} CREATE TABLE tbl (i int); INSERT INTO tbl SELECT 1;{code} 2. Close Session1 3. Open Session2 with beeline and execute SQL: {code:java} INSERT INTO tbl SELECT 1; {code} 4. Get exception in Session2: {code:java} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://xxx/user/hive/warehouse/tbl/.hive-staging_hive_2019-03-11_20-22-53_627_4903214085803350096-2/-ext-1/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000 to destination hdfs://xxx/user/hive/warehouse/tbl/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000; (state=,code=0) {code} was (Author: moriarty279): Spark 2.4.0 still has the same problem. And I want to elaborate a bit more about how this bug happened. To reproduce this bug, we have to set hive.server2.enable.doAs to true in hive-site.xml. It causes SparkThriftServer to execute SparkExecuteStatementOperation with org.apache.hive.service.cli.session.HiveSessionImplwithUGI#sessionUgi. When the first time a CTAS statement is executed, HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs is initialized using ugi from HiveSessionImplwithUGI#sessionUgi. HiveClientImpl#state#hdfsEncryptionShim#hdfsAdmin#dfs will be closed when session closes. Since HiveClientImpl is shared among SparkThriftServer and HiveClientImpl#state#hdfsEncryptionShim won't be initialized again. Subsequent usage of HiveClientImpl#state#hdfsEncryptionShim will throw "java.io.IOException: Filesystem closed". Here are the most simple steps to reproduce: 1. Open Session1 with beeline and execute SQL: {code:java} CREATE TABLE tbl (i int); INSERT INTO tbl select 1;{code} 2. Close Session1 3. Open Session2 with beeline and execute SQL: {code:java} INSERT INTO tbl select 1; {code} 4. Get exception in Session2: {code:java} Error: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://xxx/user/hive/warehouse/tbl/.hive-staging_hive_2019-03-11_20-22-53_627_4903214085803350096-2/-ext-1/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000 to destination hdfs://xxx/user/hive/warehouse/tbl/part-0-b6277b02-c70f-4784-bc99-1201ac6e6364-c000; (state=,code=0) {code} > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0.
[jira] [Updated] (SPARK-26860) Improve RangeBetween docs in Pyspark, SparkR
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26860: -- Labels: (was: docs easyfix python) Priority: Minor (was: Major) Component/s: Documentation Issue Type: Improvement (was: Bug) Summary: Improve RangeBetween docs in Pyspark, SparkR (was: RangeBetween docs appear to be wrong ) > Improve RangeBetween docs in Pyspark, SparkR > > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver
[ https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789598#comment-16789598 ] Sean Owen commented on SPARK-26961: --- I think registerAsParallelCapable() sounds like it could resolve the actual issue in practice here. Let's do that, because these ClassLoader subclasses are called concurrently in some cases. We probably can't fix Hadoop here. Looks like there are 4-5 implementations in Spark. I think we can address all of them. It would have to go in a companion object to these classes, in Scala. I don't see a downside here other than that the locking is potentially more expensive as it's finer-grained? > Found Java-level deadlock in Spark Driver > - > > Key: SPARK-26961 > URL: https://issues.apache.org/jira/browse/SPARK-26961 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Rong Jialei >Priority: Major > > Our spark job usually will finish in minutes, however, we recently found it > take days to run, and we can only kill it when this happened. > An investigation show all worker container could not connect drive after > start, and driver is hanging, using jstack, we found a Java-level deadlock. > > *Jstack output for deadlock part is showing below:* > > Found one Java-level deadlock: > = > "SparkUI-907": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > "ForkJoinPool-1-worker-57": > waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a > org.apache.spark.util.MutableURLClassLoader), > which is held by "ForkJoinPool-1-worker-7" > "ForkJoinPool-1-worker-7": > waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a > org.apache.hadoop.conf.Configuration), > which is held by "ForkJoinPool-1-worker-57" > Java stack information for the threads listed above: > === > "SparkUI-907": > at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328) > - waiting to lock <0x0005c0c1e5e0> (a > org.apache.hadoop.conf.Configuration) > at > org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145) > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363) > at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840) > at > org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74) > at java.net.URL.getURLStreamHandler(URL.java:1142) > at java.net.URL.(URL.java:599) > at java.net.URL.(URL.java:490) > at java.net.URL.(URL.java:439) > at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176) > at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:534) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at >
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789591#comment-16789591 ] Gabor Somogyi commented on SPARK-27124: --- I would mention "SchemaConverters can be used to make schema conversion" and "in python Py4J can be used to reach it" or something like this. Agree that all the technical details should not be added since it can change easily. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26860) Improve RangeBetween docs in Pyspark, SparkR
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26860. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23946 [https://github.com/apache/spark/pull/23946] > Improve RangeBetween docs in Pyspark, SparkR > > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Assignee: Jagadesh Kiran N >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26860) Improve RangeBetween docs in Pyspark, SparkR
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26860: - Assignee: Jagadesh Kiran N > Improve RangeBetween docs in Pyspark, SparkR > > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Assignee: Jagadesh Kiran N >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27116) Environment tab must sort Hadoop Configuration by default
[ https://issues.apache.org/jira/browse/SPARK-27116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27116: - Assignee: Ajith S > Environment tab must sort Hadoop Configuration by default > - > > Key: SPARK-27116 > URL: https://issues.apache.org/jira/browse/SPARK-27116 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > > Environment tab in SparkUI do not have Hadoop Configuration sorted. All other > tables in the same page like Spark Configrations, System Configuration etc > are sorted by keys by default -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26860) RangeBetween docs appear to be wrong
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-26860: --- > RangeBetween docs appear to be wrong > - > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Priority: Major > Labels: docs, easyfix, python > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27116) Environment tab must sort Hadoop Configuration by default
[ https://issues.apache.org/jira/browse/SPARK-27116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27116. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24038 [https://github.com/apache/spark/pull/24038] > Environment tab must sort Hadoop Configuration by default > - > > Key: SPARK-27116 > URL: https://issues.apache.org/jira/browse/SPARK-27116 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.0.0 > > > Environment tab in SparkUI do not have Hadoop Configuration sorted. All other > tables in the same page like Spark Configrations, System Configuration etc > are sorted by keys by default -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26152: Assignee: (was: Apache Spark) > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > Attachments: Screenshot from 2019-03-11 17-03-40.png > > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from > java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, > active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at >
[jira] [Comment Edited] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789562#comment-16789562 ] Hyukjin Kwon edited comment on SPARK-27124 at 3/11/19 1:32 PM: --- I think we can mention that generally somewhere, say, Py4J approach can be used to access to internal codes in Spark. To be honest, this approach requires to understand Py4J. So, we shouldn't focus on how to use it tho. Technically Py4J should describe how to use it. was (Author: hyukjin.kwon): I think we can mention that generally somewhere, say, Py4J approach can be used to access to internal codes in Spark. To be honest, this approach requires to understand Py4J. So, we shouldn't focus on how to use it tho. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789562#comment-16789562 ] Hyukjin Kwon commented on SPARK-27124: -- I think we can mention that generally somewhere, say, Py4J approach can be used to access to internal codes in Spark. To be honest, this approach requires to understand Py4J. So, we shouldn't focus on how to use it tho. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26152: Assignee: Apache Spark > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Critical > Attachments: Screenshot from 2019-03-11 17-03-40.png > > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from > java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, > active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at >
[jira] [Assigned] (SPARK-26820) Issue Error/Warning when Hint is not applicable
[ https://issues.apache.org/jira/browse/SPARK-26820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26820: Assignee: (was: Apache Spark) > Issue Error/Warning when Hint is not applicable > --- > > Key: SPARK-26820 > URL: https://issues.apache.org/jira/browse/SPARK-26820 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > We should issue an error or a warning when the HINT is not applicable. This > should be configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26820) Issue Error/Warning when Hint is not applicable
[ https://issues.apache.org/jira/browse/SPARK-26820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26820: Assignee: Apache Spark > Issue Error/Warning when Hint is not applicable > --- > > Key: SPARK-26820 > URL: https://issues.apache.org/jira/browse/SPARK-26820 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > We should issue an error or a warning when the HINT is not applicable. This > should be configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789494#comment-16789494 ] Gabor Somogyi commented on SPARK-27124: --- Yeah, the python implementation would use the same what you've shown and this would be a helper to make life easier. At the moment SchemaConverters not mentioned in the doc at all. Do you think it worth to add documentation to explain this area a bit more? (I know it's not API but since it's hard to come up with proper schema especially for beginners peoples somehow forced to use it) > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajith S updated SPARK-26152: Attachment: Screenshot from 2019-03-11 17-03-40.png > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > Attachments: Screenshot from 2019-03-11 17-03-40.png > > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from > java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, > active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at >
[jira] [Assigned] (SPARK-27127) Deduplicate codes reading from/writing to unsafe object
[ https://issues.apache.org/jira/browse/SPARK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27127: Assignee: (was: Apache Spark) > Deduplicate codes reading from/writing to unsafe object > --- > > Key: SPARK-27127 > URL: https://issues.apache.org/jira/browse/SPARK-27127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > IntelliJ found lots of code duplication among some Unsafe classes, and code > duplications occur from reading from/writing to unsafe object. This issue > tracks the efforts to deduplicate code block around reading from/writing to > unsafe object among various classes. > This would help not only reducing codes, but also let the way of dealing with > unsafe object consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27127) Deduplicate codes reading from/writing to unsafe object
[ https://issues.apache.org/jira/browse/SPARK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27127: Assignee: Apache Spark > Deduplicate codes reading from/writing to unsafe object > --- > > Key: SPARK-27127 > URL: https://issues.apache.org/jira/browse/SPARK-27127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Minor > > IntelliJ found lots of code duplication among some Unsafe classes, and code > duplications occur from reading from/writing to unsafe object. This issue > tracks the efforts to deduplicate code block around reading from/writing to > unsafe object among various classes. > This would help not only reducing codes, but also let the way of dealing with > unsafe object consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27127) Deduplicate codes reading from/writing to unsafe object
Jungtaek Lim created SPARK-27127: Summary: Deduplicate codes reading from/writing to unsafe object Key: SPARK-27127 URL: https://issues.apache.org/jira/browse/SPARK-27127 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Jungtaek Lim IntelliJ found lots of code duplication among some Unsafe classes, and code duplications occur from reading from/writing to unsafe object. This issue tracks the efforts to deduplicate code block around reading from/writing to unsafe object among various classes. This would help not only reducing codes, but also let the way of dealing with unsafe object consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789462#comment-16789462 ] Ajith S edited comment on SPARK-26152 at 3/11/19 12:07 PM: --- I encountered this issue on latest master branch and see that its the race between *org.apache.spark.deploy.DeployMessages.WorkDirCleanup* event and *org.apache.spark.deploy.worker.Worker#onStop*. Here its possible that while the WorkDirCleanup event is being processed, *org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor* was shutdown. hence any submission after ThreadPoolExecutor will result in *java.util.concurrent.RejectedExecutionException* Attaching the debug snapshot of same. I would like to work on this. Please suggest was (Author: ajithshetty): I encountered this issue and see that its the race between *org.apache.spark.deploy.DeployMessages.WorkDirCleanup* event and *org.apache.spark.deploy.worker.Worker#onStop*. Here its possible that while the WorkDirCleanup event is being processed, *org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor* was shutdown. hence any submission after ThreadPoolExecutor will result in *java.util.concurrent.RejectedExecutionException* Attaching the debug snapshot of same. I would like to work on this. Please suggest > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at >
[jira] [Comment Edited] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789462#comment-16789462 ] Ajith S edited comment on SPARK-26152 at 3/11/19 12:07 PM: --- I encountered this issue and see that its the race between *org.apache.spark.deploy.DeployMessages.WorkDirCleanup* event and *org.apache.spark.deploy.worker.Worker#onStop*. Here its possible that while the WorkDirCleanup event is being processed, *org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor* was shutdown. hence any submission after ThreadPoolExecutor will result in *java.util.concurrent.RejectedExecutionException* Attaching the debug snapshot of same. I would like to work on this. Please suggest was (Author: ajithshetty): I encountered this issue and see that its the race between ``org.apache.spark.deploy.DeployMessages.WorkDirCleanup`` event and onStop call of org.apache.spark.deploy.worker.Worker#onStop. Here its possible that while the WorkDirCleanup event is being processed, org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor was shutdown Attaching the debug snapshot of same. I would like to work on this. Please suggest > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at
[jira] [Commented] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789462#comment-16789462 ] Ajith S commented on SPARK-26152: - I encountered this issue and see that its the race between ``org.apache.spark.deploy.DeployMessages.WorkDirCleanup`` event and onStop call of org.apache.spark.deploy.worker.Worker#onStop. Here its possible that while the WorkDirCleanup event is being processed, org.apache.spark.deploy.worker.Worker#cleanupThreadExecutor was shutdown Attaching the debug snapshot of same. I would like to work on this. Please suggest > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from > java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, > active threads = 1, queued tasks = 0, completed tasks = 0] > at >
[jira] [Assigned] (SPARK-27126) Consolidate Scala and Java type deserializerFor
[ https://issues.apache.org/jira/browse/SPARK-27126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27126: Assignee: (was: Apache Spark) > Consolidate Scala and Java type deserializerFor > --- > > Key: SPARK-27126 > URL: https://issues.apache.org/jira/browse/SPARK-27126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > There are some duplicates between {{ScalaReflection}} and > {{JavaTypeInference}}. Ideally we should consolidate them. > This proposes to consolidate {{deserializerFor}} between {{ScalaReflection}} > and {{JavaTypeInference}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27126) Consolidate Scala and Java type deserializerFor
Liang-Chi Hsieh created SPARK-27126: --- Summary: Consolidate Scala and Java type deserializerFor Key: SPARK-27126 URL: https://issues.apache.org/jira/browse/SPARK-27126 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh There are some duplicates between {{ScalaReflection}} and {{JavaTypeInference}}. Ideally we should consolidate them. This proposes to consolidate {{deserializerFor}} between {{ScalaReflection}} and {{JavaTypeInference}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27126) Consolidate Scala and Java type deserializerFor
[ https://issues.apache.org/jira/browse/SPARK-27126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27126: Assignee: Apache Spark > Consolidate Scala and Java type deserializerFor > --- > > Key: SPARK-27126 > URL: https://issues.apache.org/jira/browse/SPARK-27126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > There are some duplicates between {{ScalaReflection}} and > {{JavaTypeInference}}. Ideally we should consolidate them. > This proposes to consolidate {{deserializerFor}} between {{ScalaReflection}} > and {{JavaTypeInference}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789388#comment-16789388 ] Hyukjin Kwon commented on SPARK-27124: -- The way of reaching it will be same as the Python implementation. Py4J allows JVM access fully. Of course, it's hacky - I wasn't trying to say this is an official way of using it. {code} >>> spark._jvm.org.apache.spark.sql.avro.SchemaConverters.toSqlType(spark._jvm.org.apache.avro.Schema.Parser().parse("""{"type": >>> "int", "name": "fieldA"}""")).toString() u'SchemaType(IntegerType,false)' {code} Usually the signatures are matched between Scala and Python sides. I suspect that you'd open a function that takes JSON-formatted schema in Avro in PySpark side, right? > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian
[ https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade reopened SPARK-26985: --- I did tried with OpenJDK. However same behavior is observed. The test fails with the same error. > Test "access only some column of the all of columns " fails on big endian > - > > Key: SPARK-26985 > URL: https://issues.apache.org/jira/browse/SPARK-26985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Linux Ubuntu 16.04 > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed > References 20190205_218 (JIT enabled, AOT enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) > >Reporter: Anuja Jakhade >Priority: Major > Labels: BigEndian > Attachments: DataFrameTungstenSuite.txt, > InMemoryColumnarQuerySuite.txt, access only some column of the all of > columns.txt > > > While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am > observing test failures for 2 Suites of Project SQL. > 1. InMemoryColumnarQuerySuite > 2. DataFrameTungstenSuite > In both the cases test "access only some column of the all of columns" fails > due to mismatch in the final assert. > Observed that the data obtained after df.cache() is causing the error. Please > find attached the log with the details. > cache() works perfectly fine if double and float values are not in picture. > Inside test !!- access only some column of the all of columns *** FAILED > *** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case
[ https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789393#comment-16789393 ] Anuja Jakhade commented on SPARK-26940: --- I did tried with OpenJDK. However same behavior is observed. The test fails with the same error. > Observed greater deviation on big endian platform for SingletonReplSuite test > case > -- > > Key: SPARK-26940 > URL: https://issues.apache.org/jira/browse/SPARK-26940 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.2 > Environment: Ubuntu 16.04 LTS > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT > enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) >Reporter: Anuja Jakhade >Priority: Minor > Labels: BigEndian > Attachments: failure_log.txt > > > I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 > 1.8.0_202. > My build is successful. However while running the scala tests of "*Spark > Project REPL*" module, I am facing failures at SingletonReplSuite with error > log as attached. > The deviation observed on big endian is greater than the acceptable deviation > 0.2. > How efficient is it to increase the deviation defined in > SingletonReplSuite.scala > Can this be fixed? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case
[ https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade reopened SPARK-26940: --- I did tried with OpenJDK. However same behavior is observed. The test fails with the same error. > Observed greater deviation on big endian platform for SingletonReplSuite test > case > -- > > Key: SPARK-26940 > URL: https://issues.apache.org/jira/browse/SPARK-26940 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.2 > Environment: Ubuntu 16.04 LTS > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT > enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) >Reporter: Anuja Jakhade >Priority: Minor > Labels: BigEndian > Attachments: failure_log.txt > > > I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 > 1.8.0_202. > My build is successful. However while running the scala tests of "*Spark > Project REPL*" module, I am facing failures at SingletonReplSuite with error > log as attached. > The deviation observed on big endian is greater than the acceptable deviation > 0.2. > How efficient is it to increase the deviation defined in > SingletonReplSuite.scala > Can this be fixed? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian
[ https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789392#comment-16789392 ] Anuja Jakhade commented on SPARK-26985: --- I did tried with OpenJDK. However same behavior is observed. The test fails with the same error. > Test "access only some column of the all of columns " fails on big endian > - > > Key: SPARK-26985 > URL: https://issues.apache.org/jira/browse/SPARK-26985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Linux Ubuntu 16.04 > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed > References 20190205_218 (JIT enabled, AOT enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) > >Reporter: Anuja Jakhade >Priority: Major > Labels: BigEndian > Attachments: DataFrameTungstenSuite.txt, > InMemoryColumnarQuerySuite.txt, access only some column of the all of > columns.txt > > > While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am > observing test failures for 2 Suites of Project SQL. > 1. InMemoryColumnarQuerySuite > 2. DataFrameTungstenSuite > In both the cases test "access only some column of the all of columns" fails > due to mismatch in the final assert. > Observed that the data obtained after df.cache() is causing the error. Please > find attached the log with the details. > cache() works perfectly fine if double and float values are not in picture. > Inside test !!- access only some column of the all of columns *** FAILED > *** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789391#comment-16789391 ] Hyukjin Kwon commented on SPARK-27124: -- I think we have workarounds, and looks difficult to match the signatures. Usually PySpark, R and Scala side have matched APIs. I am not sure if this is absolutely important to get over all those concerns. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789388#comment-16789388 ] Hyukjin Kwon edited comment on SPARK-27124 at 3/11/19 10:39 AM: The way of reaching it will be same as the Python implementation (the POC you worked on). Py4J allows JVM access fully. Of course, it's hacky - I wasn't trying to say this is an official way of using it. {code} >>> spark._jvm.org.apache.spark.sql.avro.SchemaConverters.toSqlType(spark._jvm.org.apache.avro.Schema.Parser().parse("""{"type": >>> "int", "name": "fieldA"}""")).toString() u'SchemaType(IntegerType,false)' {code} Usually the signatures are matched between Scala and Python sides. I suspect that you'd open a function that takes JSON-formatted schema in Avro in PySpark side, right? was (Author: hyukjin.kwon): The way of reaching it will be same as the Python implementation. Py4J allows JVM access fully. Of course, it's hacky - I wasn't trying to say this is an official way of using it. {code} >>> spark._jvm.org.apache.spark.sql.avro.SchemaConverters.toSqlType(spark._jvm.org.apache.avro.Schema.Parser().parse("""{"type": >>> "int", "name": "fieldA"}""")).toString() u'SchemaType(IntegerType,false)' {code} Usually the signatures are matched between Scala and Python sides. I suspect that you'd open a function that takes JSON-formatted schema in Avro in PySpark side, right? > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789366#comment-16789366 ] Hyukjin Kwon commented on SPARK-27124: -- To me, I am not sure. How would you map Avro schema in PySpark? This is reachable via Py4J FWIW. Also, I'm personally skeptical about exposing those as APIs in general if there aren't strong usecases. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27125) Add test suite for sql execution page
[ https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27125: Assignee: Apache Spark > Add test suite for sql execution page > - > > Key: SPARK-27125 > URL: https://issues.apache.org/jira/browse/SPARK-27125 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Shahid K I >Assignee: Apache Spark >Priority: Minor > > Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789376#comment-16789376 ] Gabor Somogyi commented on SPARK-27124: --- {quote}This is reachable via Py4J FWIW{quote} How exactly? Maybe just doc has to be extended to highlight this functionality. >From use-case perspective I've seen many users struggling with avro and report >problems but most of the time it was caused by wrong schema. I would like to mitigate this somehow. > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27125) Add test suite for sql execution page
[ https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27125: Assignee: (was: Apache Spark) > Add test suite for sql execution page > - > > Key: SPARK-27125 > URL: https://issues.apache.org/jira/browse/SPARK-27125 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Shahid K I >Priority: Minor > > Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26879: Assignee: (was: Apache Spark) > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > {code:title=spark-shell|borderStyle=solid} > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > +++ > |col0|col1| > +++ > | 1| 2| > | 3|null| > +++ > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > +++ > |col1|col2| > +++ > | 1| a| > | 2| b| > +++ > {code} > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26879: Assignee: Apache Spark > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Assignee: Apache Spark >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > {code:title=spark-shell|borderStyle=solid} > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > +++ > |col0|col1| > +++ > | 1| 2| > | 3|null| > +++ > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > +++ > |col1|col2| > +++ > | 1| a| > | 2| b| > +++ > {code} > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27125) Add test suite for sql execution page
Shahid K I created SPARK-27125: -- Summary: Add test suite for sql execution page Key: SPARK-27125 URL: https://issues.apache.org/jira/browse/SPARK-27125 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: Shahid K I Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27125) Add test suite for sql execution page
[ https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789367#comment-16789367 ] Shahid K I commented on SPARK-27125: I will raise a PR > Add test suite for sql execution page > - > > Key: SPARK-27125 > URL: https://issues.apache.org/jira/browse/SPARK-27125 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Shahid K I >Priority: Minor > > Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
[ https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789356#comment-16789356 ] Gabor Somogyi commented on SPARK-27124: --- I've already put together a POC but interested in what do you think [~hyukjin.kwon] [~Gengliang.Wang] [~cloud_fan]? > Expose org.apache.spark.sql.avro.SchemaConverters as developer API > -- > > Key: SPARK-27124 > URL: https://issues.apache.org/jira/browse/SPARK-27124 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to > convert schema between Spark SQL and avro. This is reachable from scala side > but not from pyspark. I suggest to add this as a developer API to ease > development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API
Gabor Somogyi created SPARK-27124: - Summary: Expose org.apache.spark.sql.avro.SchemaConverters as developer API Key: SPARK-27124 URL: https://issues.apache.org/jira/browse/SPARK-27124 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Gabor Somogyi org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to convert schema between Spark SQL and avro. This is reachable from scala side but not from pyspark. I suggest to add this as a developer API to ease development for pyspark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789341#comment-16789341 ] Gabor Somogyi commented on SPARK-20075: --- [~kabhwan] cool. I think we both can give a try and share the experience and maybe together we can sort it out. Will see... > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789338#comment-16789338 ] Jungtaek Lim commented on SPARK-20075: -- [~gsomogyi] Yeah, I have experience with Aether and it worked like a charm, but according to the long discussion in Sean's PR, looks like not easy with Ivy. I'll also give it a try, but may not also be easy to me since some semantics between Maven and Ivy are different. > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789329#comment-16789329 ] Gabor Somogyi commented on SPARK-20075: --- [~kabhwan] To continue the discussion started on SPARK-27044 as I was proceeding with the implementation I've faced similar issues just like Sean. Not telling I'm giving up just want to highlight it's not a trivial issue so feel free to create PR if you've solved it. If someone can go through and remove all the obstacles well deserves a commit. In the meantime trying out several things... > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27096) Reconcile the join type support between data frame and sql interface
[ https://issues.apache.org/jira/browse/SPARK-27096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27096. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23982 [https://github.com/apache/spark/pull/23982] > Reconcile the join type support between data frame and sql interface > > > Key: SPARK-27096 > URL: https://issues.apache.org/jira/browse/SPARK-27096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Minor > Fix For: 3.0.0 > > > Currently in the grammar file, we have the joinType rule defined as following > : > {code:java} > joinType > : INNER? > > > | LEFT SEMI > | LEFT? ANTI > ; > {code:java} > {code} > The keyword LEFT is optional for ANTI join even though its not optional for > SEMI join. When > using data frame interface join type "anti" is not allowed. The allowed > types are "left_anti" or > "leftanti" for anti joins. We should also make LEFT optional for SEMI join > and allow "semi" and "anti" join types from data frame. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27096) Reconcile the join type support between data frame and sql interface
[ https://issues.apache.org/jira/browse/SPARK-27096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27096: --- Assignee: Dilip Biswal > Reconcile the join type support between data frame and sql interface > > > Key: SPARK-27096 > URL: https://issues.apache.org/jira/browse/SPARK-27096 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Minor > > Currently in the grammar file, we have the joinType rule defined as following > : > {code:java} > joinType > : INNER? > > > | LEFT SEMI > | LEFT? ANTI > ; > {code:java} > {code} > The keyword LEFT is optional for ANTI join even though its not optional for > SEMI join. When > using data frame interface join type "anti" is not allowed. The allowed > types are "left_anti" or > "leftanti" for anti joins. We should also make LEFT optional for SEMI join > and allow "semi" and "anti" join types from data frame. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789189#comment-16789189 ] Hu Ziqian commented on SPARK-25299: --- Hi [~yifeih], your google doc posted at 25/Feb/19 is mainly talked about the new api of shuffle ant the mileStone is about implementing existing shuffle with new API. Do we have any further decision about which architecture would be used in new shuffle service? I found there are 5 options in [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] and do we already choose one of them to be the candidate? thank you > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org