[jira] [Commented] (SPARK-26103) OutOfMemory error with large query plans

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690815#comment-16690815
 ] 

Apache Spark commented on SPARK-26103:
--

User 'DaveDeCaprio' has created a pull request for this issue:
https://github.com/apache/spark/pull/23076

> OutOfMemory error with large query plans
> 
>
> Key: SPARK-26103
> URL: https://issues.apache.org/jira/browse/SPARK-26103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Amazon EMR 5.19
> 1 c5.4xlarge master instance
> 1 c5.4xlarge core instance
> 2 c5.4xlarge task instances
>Reporter: Dave DeCaprio
>Priority: Major
>
> Large query plans can cause OutOfMemory errors in the Spark driver.
> We are creating data frames that are not extremely large but contain lots of 
> nested joins.  These plans execute efficiently because of caching and 
> partitioning, but the text version of the query plans generated can be 
> hundreds of megabytes.  Running many of these in parallel causes our driver 
> process to fail.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.util.Arrays.copyOfRange(Arrays.java:2694) at 
> java.lang.String.(String.java:203) at 
> java.lang.StringBuilder.toString(StringBuilder.java:405) at 
> scala.StringContext.standardInterpolator(StringContext.scala:125) at 
> scala.StringContext.s(StringContext.scala:90) at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>  
>  
> A similar error is reported in 
> [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]
>  
> Code exists to truncate the string if the number of output columns is larger 
> than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26103) OutOfMemory error with large query plans

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26103:


Assignee: (was: Apache Spark)

> OutOfMemory error with large query plans
> 
>
> Key: SPARK-26103
> URL: https://issues.apache.org/jira/browse/SPARK-26103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Amazon EMR 5.19
> 1 c5.4xlarge master instance
> 1 c5.4xlarge core instance
> 2 c5.4xlarge task instances
>Reporter: Dave DeCaprio
>Priority: Major
>
> Large query plans can cause OutOfMemory errors in the Spark driver.
> We are creating data frames that are not extremely large but contain lots of 
> nested joins.  These plans execute efficiently because of caching and 
> partitioning, but the text version of the query plans generated can be 
> hundreds of megabytes.  Running many of these in parallel causes our driver 
> process to fail.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.util.Arrays.copyOfRange(Arrays.java:2694) at 
> java.lang.String.(String.java:203) at 
> java.lang.StringBuilder.toString(StringBuilder.java:405) at 
> scala.StringContext.standardInterpolator(StringContext.scala:125) at 
> scala.StringContext.s(StringContext.scala:90) at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>  
>  
> A similar error is reported in 
> [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]
>  
> Code exists to truncate the string if the number of output columns is larger 
> than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26103) OutOfMemory error with large query plans

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26103:


Assignee: Apache Spark

> OutOfMemory error with large query plans
> 
>
> Key: SPARK-26103
> URL: https://issues.apache.org/jira/browse/SPARK-26103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Amazon EMR 5.19
> 1 c5.4xlarge master instance
> 1 c5.4xlarge core instance
> 2 c5.4xlarge task instances
>Reporter: Dave DeCaprio
>Assignee: Apache Spark
>Priority: Major
>
> Large query plans can cause OutOfMemory errors in the Spark driver.
> We are creating data frames that are not extremely large but contain lots of 
> nested joins.  These plans execute efficiently because of caching and 
> partitioning, but the text version of the query plans generated can be 
> hundreds of megabytes.  Running many of these in parallel causes our driver 
> process to fail.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.util.Arrays.copyOfRange(Arrays.java:2694) at 
> java.lang.String.(String.java:203) at 
> java.lang.StringBuilder.toString(StringBuilder.java:405) at 
> scala.StringContext.standardInterpolator(StringContext.scala:125) at 
> scala.StringContext.s(StringContext.scala:90) at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>  
>  
> A similar error is reported in 
> [https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]
>  
> Code exists to truncate the string if the number of output columns is larger 
> than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690753#comment-16690753
 ] 

Apache Spark commented on SPARK-26084:
--

User 'ssimeonov' has created a pull request for this issue:
https://github.com/apache/spark/pull/23075

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690752#comment-16690752
 ] 

Apache Spark commented on SPARK-26084:
--

User 'ssimeonov' has created a pull request for this issue:
https://github.com/apache/spark/pull/23075

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26084:


Assignee: Apache Spark

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26084:


Assignee: (was: Apache Spark)

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690751#comment-16690751
 ] 

Simeon Simeonov commented on SPARK-26084:
-

[~hvanhovell] done [https://github.com/apache/spark/pull/23075]

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19798:


Assignee: (was: Apache Spark)

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Major
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19798:


Assignee: Apache Spark

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Assignee: Apache Spark
>Priority: Major
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690731#comment-16690731
 ] 

Apache Spark commented on SPARK-19798:
--

User 'gbloisi' has created a pull request for this issue:
https://github.com/apache/spark/pull/23074

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Major
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26104:


Assignee: Apache Spark

> make pci devices visible to task scheduler
> --
>
> Key: SPARK-26104
> URL: https://issues.apache.org/jira/browse/SPARK-26104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chen Qin
>Assignee: Apache Spark
>Priority: Major
>  Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690726#comment-16690726
 ] 

Apache Spark commented on SPARK-26104:
--

User 'chenqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23073

> make pci devices visible to task scheduler
> --
>
> Key: SPARK-26104
> URL: https://issues.apache.org/jira/browse/SPARK-26104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chen Qin
>Priority: Major
>  Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26104:


Assignee: (was: Apache Spark)

> make pci devices visible to task scheduler
> --
>
> Key: SPARK-26104
> URL: https://issues.apache.org/jira/browse/SPARK-26104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chen Qin
>Priority: Major
>  Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690725#comment-16690725
 ] 

Apache Spark commented on SPARK-26104:
--

User 'chenqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23073

> make pci devices visible to task scheduler
> --
>
> Key: SPARK-26104
> URL: https://issues.apache.org/jira/browse/SPARK-26104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chen Qin
>Priority: Major
>  Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Chen Qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qin updated SPARK-26104:
-
Labels: Hydrogen  (was: )

> make pci devices visible to task scheduler
> --
>
> Key: SPARK-26104
> URL: https://issues.apache.org/jira/browse/SPARK-26104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chen Qin
>Priority: Major
>  Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26104) make pci devices visible to task scheduler

2018-11-17 Thread Chen Qin (JIRA)
Chen Qin created SPARK-26104:


 Summary: make pci devices visible to task scheduler
 Key: SPARK-26104
 URL: https://issues.apache.org/jira/browse/SPARK-26104
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chen Qin


Spark Task scheduling has long time consider CPU only, depending on how many 
vcores each executor has at given moment, the task were scheduled as long as 
enough vcores become available.
Moving to deep learning use cases, The fundamental computation and processing 
unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
memory.

Deep learning framework build on top of GPU fleets requires fixture of task to 
number of GPUs spark haven't support yet. E.g a horord task requires 2 GPUs 
running uninterrupted before it finish regardless how CPU availability in 
executor. In Uber peloton executor scheduler, the number of cores available 
could be more than what user asked due to the fact it might get over 
provisioned.

Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
into unexpected states.

 

related jiras allocating executor containers with gpu resources, serve as 
bootstrap phase usage
SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN

Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
compatible with design, approach is a bit different as it tacks utilization of 
pci devices where customized taskscheduler could either fallback to "best to 
have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690709#comment-16690709
 ] 

Apache Spark commented on SPARK-19827:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23072

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690708#comment-16690708
 ] 

Apache Spark commented on SPARK-19827:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23072

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19827) spark.ml R API for PIC

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19827:


Assignee: (was: Apache Spark)

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19827) spark.ml R API for PIC

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19827:


Assignee: Apache Spark

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26103) OutOfMemory error with large query plans

2018-11-17 Thread Dave DeCaprio (JIRA)
Dave DeCaprio created SPARK-26103:
-

 Summary: OutOfMemory error with large query plans
 Key: SPARK-26103
 URL: https://issues.apache.org/jira/browse/SPARK-26103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0
 Environment: Amazon EMR 5.19

1 c5.4xlarge master instance

1 c5.4xlarge core instance

2 c5.4xlarge task instances
Reporter: Dave DeCaprio


Large query plans can cause OutOfMemory errors in the Spark driver.

We are creating data frames that are not extremely large but contain lots of 
nested joins.  These plans execute efficiently because of caching and 
partitioning, but the text version of the query plans generated can be hundreds 
of megabytes.  Running many of these in parallel causes our driver process to 
fail.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
java.util.Arrays.copyOfRange(Arrays.java:2694) at 
java.lang.String.(String.java:203) at 
java.lang.StringBuilder.toString(StringBuilder.java:405) at 
scala.StringContext.standardInterpolator(StringContext.scala:125) at 
scala.StringContext.s(StringContext.scala:90) at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70) 
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
 

 

A similar error is reported in 
[https://stackoverflow.com/questions/38307258/out-of-memory-error-when-writing-out-spark-dataframes-to-parquet-format]

 

Code exists to truncate the string if the number of output columns is larger 
than 25, but not if the rest of the query plan is huge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26102) Common CSV/JSON functions tests

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26102:


Assignee: Apache Spark

> Common CSV/JSON functions tests
> ---
>
> Key: SPARK-26102
> URL: https://issues.apache.org/jira/browse/SPARK-26102
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to 
> extract common those test to a common place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26102) Common CSV/JSON functions tests

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26102:


Assignee: (was: Apache Spark)

> Common CSV/JSON functions tests
> ---
>
> Key: SPARK-26102
> URL: https://issues.apache.org/jira/browse/SPARK-26102
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to 
> extract common those test to a common place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26102) Common CSV/JSON functions tests

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690653#comment-16690653
 ] 

Apache Spark commented on SPARK-26102:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23071

> Common CSV/JSON functions tests
> ---
>
> Key: SPARK-26102
> URL: https://issues.apache.org/jira/browse/SPARK-26102
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to 
> extract common those test to a common place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26102) Common CSV/JSON functions tests

2018-11-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26102:
--

 Summary: Common CSV/JSON functions tests
 Key: SPARK-26102
 URL: https://issues.apache.org/jira/browse/SPARK-26102
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


*CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to 
extract common those test to a common place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn username not the current username

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Summary: Spark Pipe() executes the external app by yarn username not the 
current username  (was: Spark Pipe() executes the external app by yarn user not 
the real user)

> Spark Pipe() executes the external app by yarn username not the current 
> username
> 
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session (Zeppelin, Shell, or spark-submit) my real username is being 
> impersonated successfully. That allows YARN to use the right queue based on 
> the username, also HDFS knows the permissions. (These all work perfectly 
> without any problem. Meaning the cluster has been set up and configured for 
> user impersonation)
> Example (running Spark by user panahi with YARN as a master):
> {code:java}
>  
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
> 18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
> with view permissions: Set();
> users with modify permissions: Set(panahi); groups with modify permissions: 
> Set()
> ...
> 18/11/17 13:55:52 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: N/A
> ApplicationMaster RPC port: -1
> queue: root.multivac
> start time: 1542459353040
> final status: UNDEFINED
> tracking URL: 
> http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
> user: panahi
> {code}
>  
> However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
> This makes it impossible to use an external app such as `c/c++` application 
> that needs read/write access to HDFS because the user `*yarn*` does not have 
> permissions on the user's directory. (also other security and resource 
> management issues by executing all the external apps as yarn username)
> *How to produce this issue:*
> {code:java}
> val test = sc.parallelize(Seq("test user")).repartition(1)
> val piped = test.pipe(Seq("whoami"))
> val c = piped.collect()
> result:
> test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition 
> at :37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at 
> pipe at :37 c: Array[String] = Array(yarn) 
> {code}
>  
> I believe since Spark is the key actor to invoke this execution inside YARN 
> cluster, Spark needs to respect the actual/current username. Or maybe there 
> is another config for impersonation between Spark and YARN in this situation, 
> but I haven't found any.
>  
> Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-17 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690616#comment-16690616
 ] 

Maxim Gekk commented on SPARK-23410:


[~x1q1j1] Encoding different from UTF-8 (except UTF-16 and UTF-32 with BOMs) 
are supported already.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26006) mllib Prefixspan

2018-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26006:
-

Assignee: shahid

> mllib Prefixspan
> 
>
> Key: SPARK-26006
> URL: https://issues.apache.org/jira/browse/SPARK-26006
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0
> Environment: Unit test running on windows
>Reporter: idan Levi
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> Mllib's Prefixspan - run method - cached RDD stays in cache. 
> val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt)
>  .persist(StorageLevel.MEMORY_AND_DISK)
> After run is comlpeted , rdd remain in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25959.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22986
[https://github.com/apache/spark/pull/22986]

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25959:
-

Assignee: Marco Gaido

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26006) mllib Prefixspan

2018-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26006:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> mllib Prefixspan
> 
>
> Key: SPARK-26006
> URL: https://issues.apache.org/jira/browse/SPARK-26006
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
> Environment: Unit test running on windows
>Reporter: idan Levi
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> Mllib's Prefixspan - run method - cached RDD stays in cache. 
> val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt)
>  .persist(StorageLevel.MEMORY_AND_DISK)
> After run is comlpeted , rdd remain in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26006) mllib Prefixspan

2018-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26006.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23016
[https://github.com/apache/spark/pull/23016]

> mllib Prefixspan
> 
>
> Key: SPARK-26006
> URL: https://issues.apache.org/jira/browse/SPARK-26006
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0
> Environment: Unit test running on windows
>Reporter: idan Levi
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> Mllib's Prefixspan - run method - cached RDD stays in cache. 
> val dataInternalRepr = toDatabaseInternalRepr(data, itemToInt)
>  .persist(StorageLevel.MEMORY_AND_DISK)
> After run is comlpeted , rdd remain in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-17 Thread xuqianjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690601#comment-16690601
 ] 

xuqianjin commented on SPARK-23410:
---

I want to ask if this bug is still being fixed, I want to try to fix it.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-17 Thread Sunil Rangwani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690595#comment-16690595
 ] 

Sunil Rangwani commented on SPARK-14492:


[~srowen] This is exactly not that. It is not about building with varying 
versions of Hive. Please refer to the discussions above.

The {{java.lang.NoSuchFieldError}} is a runtime error.

Really the documentation should be updated to say the minimum supported version 
of Hive is 1.2.x or this bug should be fixed to support different versions of 
Hive as the documentation states. 



> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-17 Thread Sunil Rangwani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690595#comment-16690595
 ] 

Sunil Rangwani edited comment on SPARK-14492 at 11/17/18 3:27 PM:
--

[~srowen] This is exactly not that. It is not about building with varying 
versions of Hive. Please refer to the discussions above.

The {{java.lang.NoSuchFieldError}} is a runtime error. The version used at 
runtime does not have this field!

Really the documentation should be updated to say the minimum supported version 
of Hive is 1.2.x or this bug should be fixed to support different versions of 
Hive as the documentation states.


was (Author: sunil.rangwani):
[~srowen] This is exactly not that. It is not about building with varying 
versions of Hive. Please refer to the discussions above.

The {{java.lang.NoSuchFieldError}} is a runtime error.

Really the documentation should be updated to say the minimum supported version 
of Hive is 1.2.x or this bug should be fixed to support different versions of 
Hive as the documentation states. 



> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25067) Active tasks does not match the total cores of an executor in WebUI

2018-11-17 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690560#comment-16690560
 ] 

shahid commented on SPARK-25067:


Hi [~stanzhai], could you please provide the reproducible test?

> Active tasks does not match the total cores of an executor in WebUI
> ---
>
> Key: SPARK-25067
> URL: https://issues.apache.org/jira/browse/SPARK-25067
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.2, 2.3.0, 2.3.1
>Reporter: StanZhai
>Priority: Major
> Attachments: WX20180810-144212.png, WechatIMG1.jpeg
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26100:


Assignee: (was: Apache Spark)

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690553#comment-16690553
 ] 

Apache Spark commented on SPARK-26100:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23038

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26100:


Assignee: Apache Spark

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user `panahi` with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

 
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> 

[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

 
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: 
[http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
>  
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session 

[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user panahi with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions. (These all work perfectly without 
any problem. Meaning the cluster has been set up and configured for user 
impersonation)

Example (running Spark by user `panahi` with YARN as a master):
{code:java}
 
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: panahi
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(panahi); groups with modify permissions: 
Set()
...
18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: panahi
{code}
 

However, when I use *Spark RDD Pipe()* it is being executed as `*yarn*` user. 
This makes it impossible to use an external app such as `c/c++` application 
that needs read/write access to HDFS because the user `*yarn*` does not have 
permissions on the user's directory. (also other security and resource 
management issues by executing all the external apps as yarn username)

*How to produce this issue:*
{code:java}
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
result:
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(yarn) 
{code}
 

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 


[jira] [Updated] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Summary: [History server ]Jobs table and Aggregate metrics table are 
showing lesser number of tasks   (was: Jobs table and Aggregate metrics table 
are showing lesser number of tasks )

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maziyar PANAHI updated SPARK-26101:
---
Description: 
Hello,

 

I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: 
[http://hadoop-master-1:8088/proxy/application_1542456252041_0006/]
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.

  was:
Hello,

 

I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```scala

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.


> Spark Pipe() executes the external app by yarn user not the real user
> -
>
> Key: SPARK-26101
> URL: https://issues.apache.org/jira/browse/SPARK-26101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Maziyar PANAHI
>Priority: Major
>
> Hello,
>  
> I am using *Spark 2.3.0.cloudera3* on Cloudera cluster. When I start my Spark 
> session 

[jira] [Created] (SPARK-26101) Spark Pipe() executes the external app by yarn user not the real user

2018-11-17 Thread Maziyar PANAHI (JIRA)
Maziyar PANAHI created SPARK-26101:
--

 Summary: Spark Pipe() executes the external app by yarn user not 
the real user
 Key: SPARK-26101
 URL: https://issues.apache.org/jira/browse/SPARK-26101
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.3.0
Reporter: Maziyar PANAHI


Hello,

 

I am using `Spark 2.3.0.cloudera3` on Cloudera cluster. When I start my Spark 
session (Zeppelin, Shell, or spark-submit) my real username is being 
impersonated successfully. That allows YARN to use the right queue based on the 
username, also HDFS knows the permissions.

Example (running Spark by user `panahi`):

```

18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls to: *panahi*
18/11/17 13:55:47 INFO spark.SecurityManager: Changing view acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: Changing modify acls groups to:
18/11/17 13:55:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mpanahi); groups 
with view permissions: Set();
users with modify permissions: Set(*panahi*); groups with modify permissions: 
Set()

...

18/11/17 13:55:52 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.multivac
start time: 1542459353040
final status: UNDEFINED
tracking URL: http://hadoop-master-1:8088/proxy/application_1542456252041_0006/
user: *panahi*

```

However, when I use Spark RDD Pipe() it is being executed as `yarn` user. This 
makes it impossible to use a `c/c++` application that needs read/write access 
to HDFS because the user `yarn` does not have permissions on the user's 
directory.

How to produce this issue:

```scala

val test = sc.parallelize(Seq("test user")).repartition(1)

val piped = test.pipe(Seq("whoami"))

val c = piped.collect()

*result:*

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[26] at repartition at 
:37 piped: org.apache.spark.rdd.RDD[String] = PipedRDD[27] at pipe at 
:37 c: Array[String] = Array(*yarn*)

```

I believe since Spark is the key actor to invoke this execution inside YARN 
cluster, Spark needs to respect the actual/current username. Or maybe there is 
another config for impersonation between Spark and YARN in this situation, but 
I haven't found any.

 

Many thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Description: 
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.
 !Screenshot from 2018-11-17 16-55-09.png! 
 

 

 

 

  !Screenshot from 2018-11-17 16-54-42.png! 

  was:
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.

 

 

 

 

  !Screenshot from 2018-11-17 16-54-42.png! 


> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>  
>  
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26099:


Assignee: (was: Apache Spark)

> Verification of the corrupt column in from_csv/from_json
> 
>
> Key: SPARK-26099
> URL: https://issues.apache.org/jira/browse/SPARK-26099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
> must be of string type and not nullable. The checking does exist in 
> DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
> CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26099:


Assignee: Apache Spark

> Verification of the corrupt column in from_csv/from_json
> 
>
> Key: SPARK-26099
> URL: https://issues.apache.org/jira/browse/SPARK-26099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
> must be of string type and not nullable. The checking does exist in 
> DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
> CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2018-11-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26091.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23059

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: (was: Screenshot from 2018-11-17 16-54-42.png)

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690498#comment-16690498
 ] 

Apache Spark commented on SPARK-26099:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23070

> Verification of the corrupt column in from_csv/from_json
> 
>
> Key: SPARK-26099
> URL: https://issues.apache.org/jira/browse/SPARK-26099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
> must be of string type and not nullable. The checking does exist in 
> DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
> CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690496#comment-16690496
 ] 

Apache Spark commented on SPARK-26026:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23069

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: (was: Screenshot from 2018-11-17 16-55-09.png)

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690495#comment-16690495
 ] 

Apache Spark commented on SPARK-26026:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23069

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: Screenshot from 2018-11-17 16-55-09.png

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
>  
>  
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26026:


Assignee: (was: Apache Spark)

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Description: 
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.
 !Screenshot from 2018-11-17 16-55-09.png! 
 

  !Screenshot from 2018-11-17 16-54-42.png! 

  was:
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.
 !Screenshot from 2018-11-17 16-55-09.png! 
 

 

 

 

  !Screenshot from 2018-11-17 16-54-42.png! 


> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26026:


Assignee: Apache Spark

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Assignee: Apache Spark
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690494#comment-16690494
 ] 

shahid commented on SPARK-26100:


Thanks. I am working on it

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
> !image-2018-11-17-16-55-37-226.png!
>  
> !image-2018-11-17-16-55-58-934.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: Screenshot from 2018-11-17 16-54-42.png

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
> !image-2018-11-17-16-55-37-226.png!
>  
> !image-2018-11-17-16-55-58-934.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Description: 
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.

 

 

 

 

  !Screenshot from 2018-11-17 16-54-42.png! 

  was:
Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.

 

!image-2018-11-17-16-55-37-226.png!

 

!image-2018-11-17-16-55-58-934.png!

 


> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
>  
>  
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: Screenshot from 2018-11-17 16-55-09.png

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
> !image-2018-11-17-16-55-37-226.png!
>  
> !image-2018-11-17-16-55-58-934.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26100:
---
Attachment: Screenshot from 2018-11-17 16-54-42.png

> Jobs table and Aggregate metrics table are showing lesser number of tasks 
> --
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-54-42.png, Screenshot from 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  
> !image-2018-11-17-16-55-37-226.png!
>  
> !image-2018-11-17-16-55-58-934.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26100) Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-17 Thread ABHISHEK KUMAR GUPTA (JIRA)
ABHISHEK KUMAR GUPTA created SPARK-26100:


 Summary: Jobs table and Aggregate metrics table are showing lesser 
number of tasks 
 Key: SPARK-26100
 URL: https://issues.apache.org/jira/browse/SPARK-26100
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.3.2
Reporter: ABHISHEK KUMAR GUPTA


Test step to reproduce:

1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}

2){{sc.parallelize(1 to 1, 10).map\{ x => throw new RuntimeException("Bad 
executor")}.collect() }}

 

3) Open Application from the history server UI

Jobs table and Aggregated metrics are showing lesser number of tasks.

 

!image-2018-11-17-16-55-37-226.png!

 

!image-2018-11-17-16-55-58-934.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26099:
--

 Summary: Verification of the corrupt column in from_csv/from_json
 Key: SPARK-26099
 URL: https://issues.apache.org/jira/browse/SPARK-26099
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
must be of string type and not nullable. The checking does exist in 
DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-17 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690479#comment-16690479
 ] 

Gengliang Wang commented on SPARK-26056:


Did you use Databricks spark-avro or built-in spark-avro in 2.4 release? 
https://spark.apache.org/docs/latest/sql-data-sources-avro.html

If you are using the Databricks one, could you also try the built-in one of 2.4 
relealse?
Otherwise, if you are using built-in spark-avro, could you provide more details 
about how you use it?
Thanks!


> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
> Attachments: sql.jpg
>
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690475#comment-16690475
 ] 

Hyukjin Kwon commented on SPARK-26056:
--

adding [~Gengliang.Wang] FYI

> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
> Attachments: sql.jpg
>
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26098) Show associated SQL query in Job page

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26098:


Assignee: (was: Apache Spark)

> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690453#comment-16690453
 ] 

Apache Spark commented on SPARK-26098:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23068

> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26098) Show associated SQL query in Job page

2018-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26098:


Assignee: Apache Spark

> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26098) Show associated SQL query in Job page

2018-11-17 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26098:
--

 Summary: Show associated SQL query in Job page
 Key: SPARK-26098
 URL: https://issues.apache.org/jira/browse/SPARK-26098
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Gengliang Wang


For jobs associated to SQL queries, it would be easier to understand the 
context to showing the SQL query in Job detail page.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-11-17 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690437#comment-16690437
 ] 

Kazuaki Ishizaki commented on SPARK-24255:
--

I do not know an existing library to parse output of {{java -version}}.
You may want to know the difference between OpenJDK and Oracle JDK, as shown 
[here|https://stackoverflow.com/questions/36445502/bash-command-to-check-if-oracle-or-openjdk-java-version-is-installed-on-linux]
 and [there|https://qiita.com/mao172/items/42aa841280dc5a4e9924].

Output of OpenJDK 12-ea.
{code}
$ ../OpenJDK-12/java -version
openjdk version "12-ea" 2019-03-19
OpenJDK Runtime Environment (build 12-ea+20)
OpenJDK 64-Bit Server VM (build 12-ea+20, mixed mode, sharing)

$ ../OpenJDK-12/java Version
jave.specification.version=12
jave.version=12-ea
jave.version.split(".")[0]=12-ea
{code}

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org