[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans
[ https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690815#comment-16690815 ] Apache Spark commented on SPARK-26103: -- User 'DaveDeCaprio' has created a pull request for this issue: https://github.com/apache/spark/pull/23076 > OutOfMemory error with large query plans > > > Key: SPARK-26103 > URL: https://issues.apache.org/jira/browse/SPARK-26103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Amazon EMR 5.19 > 1 c5.4xlarge master instance > 1 c5.4xlarge core instance > 2 c5.4xlarge task instances >Reporter: Dave DeCaprio >Priority: Major > > Large query plans can cause OutOfMemory errors in the Spark driver. > We are creating data frames that are not extremely large but contain lots of > nested joins. These plans execute efficiently because of caching and > partitioning, but the text version of the query plans generated can be > hundreds of megabytes. Running many of these in parallel causes our driver > process to fail. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:2694) at > java.lang.String.(String.java:203) at > java.lang.StringBuilder.toString(StringBuilder.java:405) at > scala.StringContext.standardInterpolator(StringContext.scala:125) at > scala.StringContext.s(StringContext.scala:90) at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > > > A similar error is reported in > [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] > > Code exists to truncate the string if the number of output columns is larger > than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26103) OutOfMemory error with large query plans
[ https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26103: Assignee: (was: Apache Spark) > OutOfMemory error with large query plans > > > Key: SPARK-26103 > URL: https://issues.apache.org/jira/browse/SPARK-26103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Amazon EMR 5.19 > 1 c5.4xlarge master instance > 1 c5.4xlarge core instance > 2 c5.4xlarge task instances >Reporter: Dave DeCaprio >Priority: Major > > Large query plans can cause OutOfMemory errors in the Spark driver. > We are creating data frames that are not extremely large but contain lots of > nested joins. These plans execute efficiently because of caching and > partitioning, but the text version of the query plans generated can be > hundreds of megabytes. Running many of these in parallel causes our driver > process to fail. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:2694) at > java.lang.String.(String.java:203) at > java.lang.StringBuilder.toString(StringBuilder.java:405) at > scala.StringContext.standardInterpolator(StringContext.scala:125) at > scala.StringContext.s(StringContext.scala:90) at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > > > A similar error is reported in > [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] > > Code exists to truncate the string if the number of output columns is larger > than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26103) OutOfMemory error with large query plans
[ https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26103: Assignee: Apache Spark > OutOfMemory error with large query plans > > > Key: SPARK-26103 > URL: https://issues.apache.org/jira/browse/SPARK-26103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Amazon EMR 5.19 > 1 c5.4xlarge master instance > 1 c5.4xlarge core instance > 2 c5.4xlarge task instances >Reporter: Dave DeCaprio >Assignee: Apache Spark >Priority: Major > > Large query plans can cause OutOfMemory errors in the Spark driver. > We are creating data frames that are not extremely large but contain lots of > nested joins. These plans execute efficiently because of caching and > partitioning, but the text version of the query plans generated can be > hundreds of megabytes. Running many of these in parallel causes our driver > process to fail. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:2694) at > java.lang.String.(String.java:203) at > java.lang.StringBuilder.toString(StringBuilder.java:405) at > scala.StringContext.standardInterpolator(StringContext.scala:125) at > scala.StringContext.s(StringContext.scala:90) at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > > > A similar error is reported in > [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] > > Code exists to truncate the string if the number of output columns is larger > than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690753#comment-16690753 ] Apache Spark commented on SPARK-26084: -- User 'ssimeonov' has created a pull request for this issue: https://github.com/apache/spark/pull/23075 > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690752#comment-16690752 ] Apache Spark commented on SPARK-26084: -- User 'ssimeonov' has created a pull request for this issue: https://github.com/apache/spark/pull/23075 > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26084: Assignee: Apache Spark > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Assignee: Apache Spark >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26084: Assignee: (was: Apache Spark) > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690751#comment-16690751 ] Simeon Simeonov commented on SPARK-26084: - [~hvanhovell] done [https://github.com/apache/spark/pull/23075] > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19798: Assignee: (was: Apache Spark) > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Major > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19798: Assignee: Apache Spark > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Giambattista >Assignee: Apache Spark >Priority: Major > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690731#comment-16690731 ] Apache Spark commented on SPARK-19798: -- User 'gbloisi' has created a pull request for this issue: https://github.com/apache/spark/pull/23074 > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Major > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26104) make pci devices visible to task scheduler
[ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26104: Assignee: Apache Spark > make pci devices visible to task scheduler > -- > > Key: SPARK-26104 > URL: https://issues.apache.org/jira/browse/SPARK-26104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chen Qin >Assignee: Apache Spark >Priority: Major > Labels: Hydrogen > > Spark Task scheduling has long time consider CPU only, depending on how many > vcores each executor has at given moment, the task were scheduled as long as > enough vcores become available. > Moving to deep learning use cases, The fundamental computation and processing > unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU > memory. > Deep learning framework build on top of GPU fleets requires fixture of task > to number of GPUs spark haven't support yet. E.g a horord task requires 2 > GPUs running uninterrupted before it finish regardless how CPU availability > in executor. In Uber peloton executor scheduler, the number of cores > available could be more than what user asked due to the fact it might get > over provisioned. > Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run > into unexpected states. > > related jiras allocating executor containers with gpu resources, serve as > bootstrap phase usage > SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN > Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, > compatible with design, approach is a bit different as it tacks utilization > of pci devices where customized taskscheduler could either fallback to "best > to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26104) make pci devices visible to task scheduler
[ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690726#comment-16690726 ] Apache Spark commented on SPARK-26104: -- User 'chenqin' has created a pull request for this issue: https://github.com/apache/spark/pull/23073 > make pci devices visible to task scheduler > -- > > Key: SPARK-26104 > URL: https://issues.apache.org/jira/browse/SPARK-26104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chen Qin >Priority: Major > Labels: Hydrogen > > Spark Task scheduling has long time consider CPU only, depending on how many > vcores each executor has at given moment, the task were scheduled as long as > enough vcores become available. > Moving to deep learning use cases, The fundamental computation and processing > unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU > memory. > Deep learning framework build on top of GPU fleets requires fixture of task > to number of GPUs spark haven't support yet. E.g a horord task requires 2 > GPUs running uninterrupted before it finish regardless how CPU availability > in executor. In Uber peloton executor scheduler, the number of cores > available could be more than what user asked due to the fact it might get > over provisioned. > Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run > into unexpected states. > > related jiras allocating executor containers with gpu resources, serve as > bootstrap phase usage > SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN > Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, > compatible with design, approach is a bit different as it tacks utilization > of pci devices where customized taskscheduler could either fallback to "best > to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26104) make pci devices visible to task scheduler
[ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26104: Assignee: (was: Apache Spark) > make pci devices visible to task scheduler > -- > > Key: SPARK-26104 > URL: https://issues.apache.org/jira/browse/SPARK-26104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chen Qin >Priority: Major > Labels: Hydrogen > > Spark Task scheduling has long time consider CPU only, depending on how many > vcores each executor has at given moment, the task were scheduled as long as > enough vcores become available. > Moving to deep learning use cases, The fundamental computation and processing > unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU > memory. > Deep learning framework build on top of GPU fleets requires fixture of task > to number of GPUs spark haven't support yet. E.g a horord task requires 2 > GPUs running uninterrupted before it finish regardless how CPU availability > in executor. In Uber peloton executor scheduler, the number of cores > available could be more than what user asked due to the fact it might get > over provisioned. > Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run > into unexpected states. > > related jiras allocating executor containers with gpu resources, serve as > bootstrap phase usage > SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN > Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, > compatible with design, approach is a bit different as it tacks utilization > of pci devices where customized taskscheduler could either fallback to "best > to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26104) make pci devices visible to task scheduler
[ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690725#comment-16690725 ] Apache Spark commented on SPARK-26104: -- User 'chenqin' has created a pull request for this issue: https://github.com/apache/spark/pull/23073 > make pci devices visible to task scheduler > -- > > Key: SPARK-26104 > URL: https://issues.apache.org/jira/browse/SPARK-26104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chen Qin >Priority: Major > Labels: Hydrogen > > Spark Task scheduling has long time consider CPU only, depending on how many > vcores each executor has at given moment, the task were scheduled as long as > enough vcores become available. > Moving to deep learning use cases, The fundamental computation and processing > unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU > memory. > Deep learning framework build on top of GPU fleets requires fixture of task > to number of GPUs spark haven't support yet. E.g a horord task requires 2 > GPUs running uninterrupted before it finish regardless how CPU availability > in executor. In Uber peloton executor scheduler, the number of cores > available could be more than what user asked due to the fact it might get > over provisioned. > Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run > into unexpected states. > > related jiras allocating executor containers with gpu resources, serve as > bootstrap phase usage > SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN > Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, > compatible with design, approach is a bit different as it tacks utilization > of pci devices where customized taskscheduler could either fallback to "best > to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26104) make pci devices visible to task scheduler
[ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qin updated SPARK-26104: - Labels: Hydrogen (was: ) > make pci devices visible to task scheduler > -- > > Key: SPARK-26104 > URL: https://issues.apache.org/jira/browse/SPARK-26104 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chen Qin >Priority: Major > Labels: Hydrogen > > Spark Task scheduling has long time consider CPU only, depending on how many > vcores each executor has at given moment, the task were scheduled as long as > enough vcores become available. > Moving to deep learning use cases, The fundamental computation and processing > unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU > memory. > Deep learning framework build on top of GPU fleets requires fixture of task > to number of GPUs spark haven't support yet. E.g a horord task requires 2 > GPUs running uninterrupted before it finish regardless how CPU availability > in executor. In Uber peloton executor scheduler, the number of cores > available could be more than what user asked due to the fact it might get > over provisioned. > Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run > into unexpected states. > > related jiras allocating executor containers with gpu resources, serve as > bootstrap phase usage > SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN > Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, > compatible with design, approach is a bit different as it tacks utilization > of pci devices where customized taskscheduler could either fallback to "best > to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26104) make pci devices visible to task scheduler
Chen Qin created SPARK-26104: Summary: make pci devices visible to task scheduler Key: SPARK-26104 URL: https://issues.apache.org/jira/browse/SPARK-26104 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Chen Qin Spark Task scheduling has long time consider CPU only, depending on how many vcores each executor has at given moment, the task were scheduled as long as enough vcores become available. Moving to deep learning use cases, The fundamental computation and processing unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU memory. Deep learning framework build on top of GPU fleets requires fixture of task to number of GPUs spark haven't support yet. E.g a horord task requires 2 GPUs running uninterrupted before it finish regardless how CPU availability in executor. In Uber peloton executor scheduler, the number of cores available could be more than what user asked due to the fact it might get over provisioned. Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run into unexpected states. related jiras allocating executor containers with gpu resources, serve as bootstrap phase usage SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, compatible with design, approach is a bit different as it tacks utilization of pci devices where customized taskscheduler could either fallback to "best to have" approach or implement "must have" approach stated above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690709#comment-16690709 ] Apache Spark commented on SPARK-19827: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/23072 > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690708#comment-16690708 ] Apache Spark commented on SPARK-19827: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/23072 > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19827: Assignee: (was: Apache Spark) > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19827: Assignee: Apache Spark > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26103) OutOfMemory error with large query plans
Dave DeCaprio created SPARK-26103: - Summary: OutOfMemory error with large query plans Key: SPARK-26103 URL: https://issues.apache.org/jira/browse/SPARK-26103 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0 Environment: Amazon EMR 5.19 1 c5.4xlarge master instance 1 c5.4xlarge core instance 2 c5.4xlarge task instances Reporter: Dave DeCaprio Large query plans can cause OutOfMemory errors in the Spark driver. We are creating data frames that are not extremely large but contain lots of nested joins. These plans execute efficiently because of caching and partitioning, but the text version of the query plans generated can be hundreds of megabytes. Running many of these in parallel causes our driver process to fail. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) at java.lang.String.(String.java:203) at java.lang.StringBuilder.toString(StringBuilder.java:405) at scala.StringContext.standardInterpolator(StringContext.scala:125) at scala.StringContext.s(StringContext.scala:90) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) A similar error is reported in [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format] Code exists to truncate the string if the number of output columns is larger than 25, but not if the rest of the query plan is huge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26102) Common CSV/JSON functions tests
[ https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26102: Assignee: Apache Spark > Common CSV/JSON functions tests > --- > > Key: SPARK-26102 > URL: https://issues.apache.org/jira/browse/SPARK-26102 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to > extract common those test to a common place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26102) Common CSV/JSON functions tests
[ https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26102: Assignee: (was: Apache Spark) > Common CSV/JSON functions tests > --- > > Key: SPARK-26102 > URL: https://issues.apache.org/jira/browse/SPARK-26102 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to > extract common those test to a common place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26102) Common CSV/JSON functions tests
[ https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690653#comment-16690653 ] Apache Spark commented on SPARK-26102: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23071 > Common CSV/JSON functions tests > --- > > Key: SPARK-26102 > URL: https://issues.apache.org/jira/browse/SPARK-26102 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to > extract common those test to a common place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26102) Common CSV/JSON functions tests
Maxim Gekk created SPARK-26102: -- Summary: Common CSV/JSON functions tests Key: SPARK-26102 URL: https://issues.apache.org/jira/browse/SPARK-26102 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to extract common those test to a common place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Summary: Spark Pipe() executes the external app by yarn username not the current username (was: Spark Pipe() executes the external app by yarn user not the real user) > Spark Pipe() executes the external app by yarn username not the current > username > > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session (Zeppelin, Shell, or spark-submit) my real username is being > impersonated successfully. That allows YARN to use the right queue based on > the username, also HDFS knows the permissions. (These all work perfectly > without any problem. Meaning the cluster has been set up and configured for > user impersonation) > Example (running Spark by user panahi with YARN as a master): > {code:java} > > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: > 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups > with view permissions: Set(); > users with modify permissions: Set(panahi); groups with modify permissions: > Set() > ... > 18/11/17 13:55:52 INFO yarn.Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: root.multivac > start time: 1542459353040 > final status: UNDEFINED > tracking URL: > http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ > user: panahi > {code} > > However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. > This makes it impossible to use an external app such as `c/c++` application > that needs read/write access to HDFS because the user `*yarn*` does not have > permissions on the user's directory. (also other security and resource > management issues by executing all the external apps as yarn username) > *How to produce this issue:* > {code:java} > val test = sc.parallelize(Seq("test user")).repartition(1) > val piped = test.pipe(Seq("whoami")) > val c = piped.collect() > result: > test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition > at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at > pipe at :37 c: Array[String] = Array(yarn) > {code} > > I believe since Spark is the key actor to invoke this execution inside YARN > cluster, Spark needs to respect the actual/current username. Or maybe there > is another config for impersonation between Spark and YARN in this situation, > but I haven't found any. > > Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690616#comment-16690616 ] Maxim Gekk commented on SPARK-23410: [~x1q1j1] Encoding different from UTF-8 (except UTF-16 and UTF-32 with BOMs) are supported already. > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26006) mllib Prefixspan
[ https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26006: - Assignee: shahid > mllib Prefixspan > > > Key: SPARK-26006 > URL: https://issues.apache.org/jira/browse/SPARK-26006 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.3.0 > Environment: Unit test running on windows >Reporter: idan Levi >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > > Mllib's Prefixspan - run method - cached RDD stays in cache. > val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt) > .persist(StorageLevel.MEMORY_AND_DISK) > After run is comlpeted , rdd remain in cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25959) Difference in featureImportances results on computed vs saved models
[ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25959. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22986 [https://github.com/apache/spark/pull/22986] > Difference in featureImportances results on computed vs saved models > > > Key: SPARK-25959 > URL: https://issues.apache.org/jira/browse/SPARK-25959 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Suraj Nayak >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > I tried to implement GBT and found that the feature Importance computed while > the model was fit is different when the same model was saved into a storage > and loaded back. > > I also found that once the persistent model is loaded and saved back again > and loaded, the feature importance remains the same. > > Not sure if its bug while storing and reading the model first time or am > missing some parameter that need to be set before saving the model (thus > model is picking some defaults - causing feature importance to change) > > *Below is the test code:* > val testDF = Seq( > (1, 3, 2, 1, 1), > (3, 2, 1, 2, 0), > (2, 2, 1, 1, 0), > (3, 4, 2, 2, 0), > (2, 2, 1, 3, 1) > ).toDF("a", "b", "c", "d", "e") > val featureColumns = testDF.columns.filter(_ != "e") > // Assemble the features into a vector > val assembler = new > VectorAssembler().setInputCols(featureColumns).setOutputCol("features") > // Transform the data to get the feature data set > val featureDF = assembler.transform(testDF) > // Train a GBT model. > val gbt = new GBTClassifier() > .setLabelCol("e") > .setFeaturesCol("features") > .setMaxDepth(2) > .setMaxBins(5) > .setMaxIter(10) > .setSeed(10) > .fit(featureDF) > gbt.transform(featureDF).show(false) > // Write out the model > featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* Prints > (d,0.5931875075767403) > (a,0.3747184548362353) > (b,0.03209403758702444) > (c,0.0) > */ > gbt.write.overwrite().save("file:///tmp/test123") > println("Reading model again") > val gbtload = GBTClassificationModel.load("file:///tmp/test123") > featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* > Prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ > gbtload.write.overwrite().save("file:///tmp/test123_rewrite") > val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") > featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25959) Difference in featureImportances results on computed vs saved models
[ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25959: - Assignee: Marco Gaido > Difference in featureImportances results on computed vs saved models > > > Key: SPARK-25959 > URL: https://issues.apache.org/jira/browse/SPARK-25959 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Suraj Nayak >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > I tried to implement GBT and found that the feature Importance computed while > the model was fit is different when the same model was saved into a storage > and loaded back. > > I also found that once the persistent model is loaded and saved back again > and loaded, the feature importance remains the same. > > Not sure if its bug while storing and reading the model first time or am > missing some parameter that need to be set before saving the model (thus > model is picking some defaults - causing feature importance to change) > > *Below is the test code:* > val testDF = Seq( > (1, 3, 2, 1, 1), > (3, 2, 1, 2, 0), > (2, 2, 1, 1, 0), > (3, 4, 2, 2, 0), > (2, 2, 1, 3, 1) > ).toDF("a", "b", "c", "d", "e") > val featureColumns = testDF.columns.filter(_ != "e") > // Assemble the features into a vector > val assembler = new > VectorAssembler().setInputCols(featureColumns).setOutputCol("features") > // Transform the data to get the feature data set > val featureDF = assembler.transform(testDF) > // Train a GBT model. > val gbt = new GBTClassifier() > .setLabelCol("e") > .setFeaturesCol("features") > .setMaxDepth(2) > .setMaxBins(5) > .setMaxIter(10) > .setSeed(10) > .fit(featureDF) > gbt.transform(featureDF).show(false) > // Write out the model > featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* Prints > (d,0.5931875075767403) > (a,0.3747184548362353) > (b,0.03209403758702444) > (c,0.0) > */ > gbt.write.overwrite().save("file:///tmp/test123") > println("Reading model again") > val gbtload = GBTClassificationModel.load("file:///tmp/test123") > featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* > Prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ > gbtload.write.overwrite().save("file:///tmp/test123_rewrite") > val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") > featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26006) mllib Prefixspan
[ https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26006: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > mllib Prefixspan > > > Key: SPARK-26006 > URL: https://issues.apache.org/jira/browse/SPARK-26006 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 > Environment: Unit test running on windows >Reporter: idan Levi >Assignee: shahid >Priority: Minor > Fix For: 3.0.0 > > > Mllib's Prefixspan - run method - cached RDD stays in cache. > val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt) > .persist(StorageLevel.MEMORY_AND_DISK) > After run is comlpeted , rdd remain in cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26006) mllib Prefixspan
[ https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26006. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23016 [https://github.com/apache/spark/pull/23016] > mllib Prefixspan > > > Key: SPARK-26006 > URL: https://issues.apache.org/jira/browse/SPARK-26006 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.3.0 > Environment: Unit test running on windows >Reporter: idan Levi >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > > Mllib's Prefixspan - run method - cached RDD stays in cache. > val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt) > .persist(StorageLevel.MEMORY_AND_DISK) > After run is comlpeted , rdd remain in cache. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690601#comment-16690601 ] xuqianjin commented on SPARK-23410: --- I want to ask if this bug is still being fixed, I want to try to fix it. > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690595#comment-16690595 ] Sunil Rangwani commented on SPARK-14492: [~srowen] This is exactly not that. It is not about building with varying versions of Hive. Please refer to the discussions above. The {{java.lang.NoSuchFieldError}} is a runtime error. Really the documentation should be updated to say the minimum supported version of Hive is 1.2.x or this bug should be fixed to support different versions of Hive as the documentation states. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690595#comment-16690595 ] Sunil Rangwani edited comment on SPARK-14492 at 11/17/18 3:27 PM: -- [~srowen] This is exactly not that. It is not about building with varying versions of Hive. Please refer to the discussions above. The {{java.lang.NoSuchFieldError}} is a runtime error. The version used at runtime does not have this field! Really the documentation should be updated to say the minimum supported version of Hive is 1.2.x or this bug should be fixed to support different versions of Hive as the documentation states. was (Author: sunil.rangwani): [~srowen] This is exactly not that. It is not about building with varying versions of Hive. Please refer to the discussions above. The {{java.lang.NoSuchFieldError}} is a runtime error. Really the documentation should be updated to say the minimum supported version of Hive is 1.2.x or this bug should be fixed to support different versions of Hive as the documentation states. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI
[ https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690560#comment-16690560 ] shahid commented on SPARK-25067: Hi [~stanzhai], could you please provide the reproducible test? > Active tasks does not match the total cores of an executor in WebUI > --- > > Key: SPARK-25067 > URL: https://issues.apache.org/jira/browse/SPARK-25067 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.2, 2.3.0, 2.3.1 >Reporter: StanZhai >Priority: Major > Attachments: WX20180810-144212.png, WechatIMG1.jpeg > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26100: Assignee: (was: Apache Spark) > [History server ]Jobs table and Aggregate metrics table are showing lesser > number of tasks > --- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690553#comment-16690553 ] Apache Spark commented on SPARK-26100: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23038 > [History server ]Jobs table and Aggregate metrics table are showing lesser > number of tasks > --- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26100: Assignee: Apache Spark > [History server ]Jobs table and Aggregate metrics table are showing lesser > number of tasks > --- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Apache Spark >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user `panahi` with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 >
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ``` val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user panahi with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. (These all work perfectly without any problem. Meaning the cluster has been set up and configured for user impersonation) Example (running Spark by user `panahi` with YARN as a master): {code:java} 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(panahi); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: panahi {code} However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. This makes it impossible to use an external app such as `c/c++` application that needs read/write access to HDFS because the user `*yarn*` does not have permissions on the user's directory. (also other security and resource management issues by executing all the external apps as yarn username) *How to produce this issue:* {code:java} val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() result: test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(yarn) {code} I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any.
[jira] [Updated] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Summary: [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks (was: Jobs table and Aggregate metrics table are showing lesser number of tasks ) > [History server ]Jobs table and Aggregate metrics table are showing lesser > number of tasks > --- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
[ https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maziyar PANAHI updated SPARK-26101: --- Description: Hello, I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: [http://hadoop-master-1:8088/proxy/application_1542456252041_0006/] user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ``` val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. was: Hello, I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ```scala val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. > Spark Pipe() executes the external app by yarn user not the real user > - > > Key: SPARK-26101 > URL: https://issues.apache.org/jira/browse/SPARK-26101 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Maziyar PANAHI >Priority: Major > > Hello, > > I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark > session
[jira] [Created] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user
Maziyar PANAHI created SPARK-26101: -- Summary: Spark Pipe() executes the external app by yarn user not the real user Key: SPARK-26101 URL: https://issues.apache.org/jira/browse/SPARK-26101 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.3.0 Reporter: Maziyar PANAHI Hello, I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark session (Zeppelin, Shell, or spark-submit) my real username is being impersonated successfully. That allows YARN to use the right queue based on the username, also HDFS knows the permissions. Example (running Spark by user `panahi`): ``` 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi* 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to: 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups with view permissions: Set(); users with modify permissions: Set(*panahi*); groups with modify permissions: Set() ... 18/11/17 13:55:52 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.multivac start time: 1542459353040 final status: UNDEFINED tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/ user: *panahi* ``` However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This makes it impossible to use a `c/c++` application that needs read/write access to HDFS because the user `yarn` does not have permissions on the user's directory. How to produce this issue: ```scala val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect() *result:* test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at :37 c: Array[String] = Array(*yarn*) ``` I believe since Spark is the key actor to invoke this execution inside YARN cluster, Spark needs to respect the actual/current username. Or maybe there is another config for impersonation between Spark and YARN in this situation, but I haven't found any. Many thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Description: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !Screenshot from 2018-11-17 16-55-09.png! !Screenshot from 2018-11-17 16-54-42.png! was: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !Screenshot from 2018-11-17 16-54-42.png! > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > > > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26099) Verification of the corrupt column in from_csv/from_json
[ https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26099: Assignee: (was: Apache Spark) > Verification of the corrupt column in from_csv/from_json > > > Key: SPARK-26099 > URL: https://issues.apache.org/jira/browse/SPARK-26099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* > must be of string type and not nullable. The checking does exist in > DataFrameReader and JSON/CSVFileFormat, and the same should be added to > CsvToStructs and to JsonToStructs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26099) Verification of the corrupt column in from_csv/from_json
[ https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26099: Assignee: Apache Spark > Verification of the corrupt column in from_csv/from_json > > > Key: SPARK-26099 > URL: https://issues.apache.org/jira/browse/SPARK-26099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* > must be of string type and not nullable. The checking does exist in > DataFrameReader and JSON/CSVFileFormat, and the same should be added to > CsvToStructs and to JsonToStructs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26091. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23059 > Upgrade to 2.3.4 for Hive Metastore Client 2.3 > -- > > Key: SPARK-26091 > URL: https://issues.apache.org/jira/browse/SPARK-26091 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: (was: Screenshot from 2018-11-17 16-54-42.png) > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26099) Verification of the corrupt column in from_csv/from_json
[ https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690498#comment-16690498 ] Apache Spark commented on SPARK-26099: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23070 > Verification of the corrupt column in from_csv/from_json > > > Key: SPARK-26099 > URL: https://issues.apache.org/jira/browse/SPARK-26099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* > must be of string type and not nullable. The checking does exist in > DataFrameReader and JSON/CSVFileFormat, and the same should be added to > CsvToStructs and to JsonToStructs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690496#comment-16690496 ] Apache Spark commented on SPARK-26026: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23069 > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: (was: Screenshot from 2018-11-17 16-55-09.png) > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690495#comment-16690495 ] Apache Spark commented on SPARK-26026: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23069 > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: Screenshot from 2018-11-17 16-55-09.png > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > > > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26026: Assignee: (was: Apache Spark) > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Description: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !Screenshot from 2018-11-17 16-55-09.png! !Screenshot from 2018-11-17 16-54-42.png! was: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !Screenshot from 2018-11-17 16-55-09.png! !Screenshot from 2018-11-17 16-54-42.png! > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > !Screenshot from 2018-11-17 16-55-09.png! > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26026: Assignee: Apache Spark > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Assignee: Apache Spark >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690494#comment-16690494 ] shahid commented on SPARK-26100: Thanks. I am working on it > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > !image-2018-11-17-16-55-37-226.png! > > !image-2018-11-17-16-55-58-934.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: Screenshot from 2018-11-17 16-54-42.png > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > !image-2018-11-17-16-55-37-226.png! > > !image-2018-11-17-16-55-58-934.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Description: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !Screenshot from 2018-11-17 16-54-42.png! was: Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !image-2018-11-17-16-55-37-226.png! !image-2018-11-17-16-55-58-934.png! > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > > > > !Screenshot from 2018-11-17 16-54-42.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: Screenshot from 2018-11-17 16-55-09.png > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > !image-2018-11-17-16-55-37-226.png! > > !image-2018-11-17-16-55-58-934.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
[ https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26100: --- Attachment: Screenshot from 2018-11-17 16-54-42.png > Jobs table and Aggregate metrics table are showing lesser number of tasks > -- > > Key: SPARK-26100 > URL: https://issues.apache.org/jira/browse/SPARK-26100 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from > 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png > > > Test step to reproduce: > 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} > 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad > executor")}.collect() }} > > 3) Open Application from the history server UI > Jobs table and Aggregated metrics are showing lesser number of tasks. > > !image-2018-11-17-16-55-37-226.png! > > !image-2018-11-17-16-55-58-934.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks
ABHISHEK KUMAR GUPTA created SPARK-26100: Summary: Jobs table and Aggregate metrics table are showing lesser number of tasks Key: SPARK-26100 URL: https://issues.apache.org/jira/browse/SPARK-26100 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.3.2 Reporter: ABHISHEK KUMAR GUPTA Test step to reproduce: 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}} 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad executor")}.collect() }} 3) Open Application from the history server UI Jobs table and Aggregated metrics are showing lesser number of tasks. !image-2018-11-17-16-55-37-226.png! !image-2018-11-17-16-55-58-934.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26099) Verification of the corrupt column in from_csv/from_json
Maxim Gekk created SPARK-26099: -- Summary: Verification of the corrupt column in from_csv/from_json Key: SPARK-26099 URL: https://issues.apache.org/jira/browse/SPARK-26099 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* must be of string type and not nullable. The checking does exist in DataFrameReader and JSON/CSVFileFormat, and the same should be added to CsvToStructs and to JsonToStructs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690479#comment-16690479 ] Gengliang Wang commented on SPARK-26056: Did you use Databricks spark-avro or built-in spark-avro in 2.4 release? https://spark.apache.org/docs/latest/sql-data-sources-avro.html If you are using the Databricks one, could you also try the built-in one of 2.4 relealse? Otherwise, if you are using built-in spark-avro, could you provide more details about how you use it? Thanks! > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > Attachments: sql.jpg > > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690475#comment-16690475 ] Hyukjin Kwon commented on SPARK-26056: -- adding [~Gengliang.Wang] FYI > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > Attachments: sql.jpg > > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26098: Assignee: (was: Apache Spark) > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690453#comment-16690453 ] Apache Spark commented on SPARK-26098: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/23068 > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26098: Assignee: Apache Spark > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26098) Show associated SQL query in Job page
Gengliang Wang created SPARK-26098: -- Summary: Show associated SQL query in Job page Key: SPARK-26098 URL: https://issues.apache.org/jira/browse/SPARK-26098 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Gengliang Wang For jobs associated to SQL queries, it would be easier to understand the context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description
[ https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690437#comment-16690437 ] Kazuaki Ishizaki commented on SPARK-24255: -- I do not know an existing library to parse output of {{java -version}}. You may want to know the difference between OpenJDK and Oracle JDK, as shown [here|https://stackoverflow.com/questions/36445502/bash-command-to-check-if-oracle-or-openjdk-java-version-is-installed-on-linux] and [there|https://qiita.com/mao172/items/42aa841280dc5a4e9924]. Output of OpenJDK 12-ea. {code} $ ../OpenJDK-12/java -version openjdk version "12-ea" 2019-03-19 OpenJDK Runtime Environment (build 12-ea+20) OpenJDK 64-Bit Server VM (build 12-ea+20, mixed mode, sharing) $ ../OpenJDK-12/java Version jave.specification.version=12 jave.version=12-ea jave.version.split(".")[0]=12-ea {code} > Require Java 8 in SparkR description > > > Key: SPARK-24255 > URL: https://issues.apache.org/jira/browse/SPARK-24255 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > CRAN checks require that the Java version be set both in package description > and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org