date:20180511

[jira] [Assigned] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24256:


Assignee: Apache Spark

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Assignee: Apache Spark
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations: 
>  1. We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  2. We can not use some type-safe aggregation methods on this Dataset, such 
> as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  3. We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation that Spark does not support define a Scala case class/tuple 
> with subfields being any other user-defined type, is because 
> ExpressionEncoder does not discover the implicit Encoder for the user-defined 
> field types, thus can not use any Encoder to serde the user-defined fields in 
> case class/tuple.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean's ExpressionEncoder to discover the 
> serializer/deserializer/schema from the Encoder of the user-defined type.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> these limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472933#comment-16472933
 ] 

Apache Spark commented on SPARK-24256:
--

User 'fangshil' has created a pull request for this issue:
https://github.com/apache/spark/pull/21310

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations: 
>  1. We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  2. We can not use some type-safe aggregation methods on this Dataset, such 
> as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  3. We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation that Spark does not support define a Scala case class/tuple 
> with subfields being any other user-defined type, is because 
> ExpressionEncoder does not discover the implicit Encoder for the user-defined 
> field types, thus can not use any Encoder to serde the user-defined fields in 
> case class/tuple.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean's ExpressionEncoder to discover the 
> serializer/deserializer/schema from the Encoder of the user-defined type.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> these limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24256:


Assignee: (was: Apache Spark)

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations: 
>  1. We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  2. We can not use some type-safe aggregation methods on this Dataset, such 
> as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  3. We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation that Spark does not support define a Scala case class/tuple 
> with subfields being any other user-defined type, is because 
> ExpressionEncoder does not discover the implicit Encoder for the user-defined 
> field types, thus can not use any Encoder to serde the user-defined fields in 
> case class/tuple.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean's ExpressionEncoder to discover the 
> serializer/deserializer/schema from the Encoder of the user-defined type.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> these limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24256:
---
Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find Dataset is not flexible for other 
user-defined types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the

[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li updated SPARK-24256:
---
Description: 
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
 1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
 2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
 3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The limitation that Spark does not support define a Scala case class/tuple with 
subfields being any other user-defined type, is because ExpressionEncoder does 
not discover the implicit Encoder for the user-defined field types, thus can 
not use any Encoder to serde the user-defined fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we remove 
these limitations with cluster-default conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 

  was:
Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply 
define a Scala case class/tuple with subfields being any other user-defined 
type, since ExpressionEncoder does not discover the implicit Encoder for the 
user-defined fields, thus can not use any Encoder to serde the user-defined 
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it 
work with conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 


> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find it is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or

[jira] [Created] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-05-11 Thread Fangshi Li (JIRA)

Fangshi Li created SPARK-24256:
--

 Summary: ExpressionEncoder should support user-defined types as 
fields of Scala case class and tuple
 Key: SPARK-24256
 URL: https://issues.apache.org/jira/browse/SPARK-24256
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Fangshi Li


Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
scala case class, tuple and java bean class. Spark's Dataset natively supports 
these mentioned types, but we find it is not flexible for other user-defined 
types and encoders.

For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
Although we can use AvroEncoder to define Dataset with types being the Avro 
Generic or Specific Record, using such Avro typed Dataset has many limitations: 
1. We can not use joinWith on this Dataset since the result is a tuple, but 
Avro types cannot be the field of this tuple.
2. We can not use some type-safe aggregation methods on this Dataset, such as 
KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
3. We cannot augment an Avro SpecificRecord with additional primitive fields 
together in a case class, which we find is a very common use case.

The root cause for these limitations is that Spark does not support simply 
define a Scala case class/tuple with subfields being any other user-defined 
type, since ExpressionEncoder does not discover the implicit Encoder for the 
user-defined fields, thus can not use any Encoder to serde the user-defined 
fields in case class/tuple.

To address this issue, we propose a trait as a contract(between 
ExpressionEncoder and any other user-defined Encoder) to enable case 
class/tuple/java bean's ExpressionEncoder to discover the 
serializer/deserializer/schema from the Encoder of the user-defined type.

With this proposed patch and our minor modification in AvroEncoder, we make it 
work with conf 
spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
com.databricks.spark.avro.AvroEncoder$

This is a patch we have implemented internally and has been used for a few 
quarters. We want to propose to upstream as we think this is a useful feature 
to make Dataset more flexible to user types.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-11 Thread Nikolay Sokolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov updated SPARK-24174:

Description: 
Currently, UI or /environment API call of HistoryServer or WebUI exposes only 
system properties and SparkConf. However, in some cases when Spark is used in 
conjunction with Hadoop, it is useful to know Hadoop configuration properties. 
For example, HDFS or GS buffer sizes, hive metastore settings, and so on.

So it would be good to have hadoop properties being exposed in /environment 
API, for example:
{code:none}
GET .../application_1525395994996_5/environment
{
   "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
   "sparkProperties": ["java.io.tmpdir","/tmp", ...],
   "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
...],
   "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
Classpath"], ...],
   "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...],
}
{code}

  was:
Currently, /environment API call exposes only system properties and SparkConf. 
However, in some cases when Spark is used in conjunction with Hadoop, it is 
useful to know Hadoop configuration properties. For example, HDFS or GS buffer 
sizes, hive metastore settings, and so on.

So it would be good to have hadoop properties being exposed in /environment 
API, for example:
{code:none}
GET .../application_1525395994996_5/environment
{
   "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
   "sparkProperties": ["java.io.tmpdir","/tmp", ...],
   "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
...],
   "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
Classpath"], ...],
   "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...],
}
{code}


> Expose Hadoop config as part of /environment API
> 
>
> Key: SPARK-24174
> URL: https://issues.apache.org/jira/browse/SPARK-24174
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Nikolay Sokolov
>Priority: Minor
>  Labels: features, usability
>
> Currently, UI or /environment API call of HistoryServer or WebUI exposes only 
> system properties and SparkConf. However, in some cases when Spark is used in 
> conjunction with Hadoop, it is useful to know Hadoop configuration 
> properties. For example, HDFS or GS buffer sizes, hive metastore settings, 
> and so on.
> So it would be good to have hadoop properties being exposed in /environment 
> API, for example:
> {code:none}
> GET .../application_1525395994996_5/environment
> {
>"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
>"sparkProperties": ["java.io.tmpdir","/tmp", ...],
>"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
> ...],
>"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
> Classpath"], ...],
>"hadoopProperties": [["dfs.stream-buffer-size", 4096], ...],
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-11 Thread Nikolay Sokolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov updated SPARK-24174:

Description: 
Currently, /environment API call exposes only system properties and SparkConf. 
However, in some cases when Spark is used in conjunction with Hadoop, it is 
useful to know Hadoop configuration properties. For example, HDFS or GS buffer 
sizes, hive metastore settings, and so on.

So it would be good to have hadoop properties being exposed in /environment 
API, for example:
{code:none}
GET .../application_1525395994996_5/environment
{
   "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
   "sparkProperties": ["java.io.tmpdir","/tmp", ...],
   "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
...],
   "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
Classpath"], ...],
   "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...],
}
{code}

  was:
Currently, /environment API call exposes only system properties and SparkConf. 
However, in some cases when Spark is used in conjunction with Hadoop, it is 
useful to know Hadoop configuration properties. For example, HDFS or GS buffer 
sizes, hive metastore settings, and so on.

So it would be good to have hadoop properties being exposed in /environment 
API, for example:
{code:none}
GET .../application_1525395994996_5/environment
{
   "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
   "sparkProperties": ["java.io.tmpdir","/tmp", ...],
   "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
...],
   "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
Classpath"], ...],
   "hadoopProperties": [["dfs.stream-buffer-size": 4096], ...],
}
{code}


> Expose Hadoop config as part of /environment API
> 
>
> Key: SPARK-24174
> URL: https://issues.apache.org/jira/browse/SPARK-24174
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Nikolay Sokolov
>Priority: Minor
>  Labels: features, usability
>
> Currently, /environment API call exposes only system properties and 
> SparkConf. However, in some cases when Spark is used in conjunction with 
> Hadoop, it is useful to know Hadoop configuration properties. For example, 
> HDFS or GS buffer sizes, hive metastore settings, and so on.
> So it would be good to have hadoop properties being exposed in /environment 
> API, for example:
> {code:none}
> GET .../application_1525395994996_5/environment
> {
>"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
>"sparkProperties": ["java.io.tmpdir","/tmp", ...],
>"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
> ...],
>"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
> Classpath"], ...],
>"hadoopProperties": [["dfs.stream-buffer-size", 4096], ...],
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-11 Thread Nikolay Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472884#comment-16472884
 ] 

Nikolay Sokolov commented on SPARK-24174:
-

[~jerryshao] as far a I understand, YARN exposes setting which are effective 
for RM server, but not effective settings for specific job. For example it 
would not include hive settings (as yarn does not load hive-site.xml), also it 
would be current values, not point-in-time for the job (which may not be 
convenient, if cluster is being tuned over time). Also, client may have 
different contents of HADOOP_CONF_DIR compared to YARN RM (consider desktop 
clients, multi-cluster edge nodes, programmatic clients working outside of 
cluster).

> Expose Hadoop config as part of /environment API
> 
>
> Key: SPARK-24174
> URL: https://issues.apache.org/jira/browse/SPARK-24174
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Nikolay Sokolov
>Priority: Minor
>  Labels: features, usability
>
> Currently, /environment API call exposes only system properties and 
> SparkConf. However, in some cases when Spark is used in conjunction with 
> Hadoop, it is useful to know Hadoop configuration properties. For example, 
> HDFS or GS buffer sizes, hive metastore settings, and so on.
> So it would be good to have hadoop properties being exposed in /environment 
> API, for example:
> {code:none}
> GET .../application_1525395994996_5/environment
> {
>"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
>"sparkProperties": ["java.io.tmpdir","/tmp", ...],
>"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
> ...],
>"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
> Classpath"], ...],
>"hadoopProperties": [["dfs.stream-buffer-size": 4096], ...],
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472817#comment-16472817
 ] 

Shivaram Venkataraman commented on SPARK-24255:
---

Resolved by https://github.com/apache/spark/pull/21278

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-24255.
---
   Resolution: Fixed
 Assignee: Shivaram Venkataraman
Fix Version/s: 2.4.0
   2.3.1

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24255) Require Java 8 in SparkR description

2018-05-11 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-24255:
-

 Summary: Require Java 8 in SparkR description
 Key: SPARK-24255
 URL: https://issues.apache.org/jira/browse/SPARK-24255
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Shivaram Venkataraman


CRAN checks require that the Java version be set both in package description 
and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23907) Support regr_* functions

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472797#comment-16472797
 ] 

Apache Spark commented on SPARK-23907:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21309

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24254) Eagerly evaluate some subqueries over LocalRelation

2018-05-11 Thread Henry Robinson (JIRA)

Henry Robinson created SPARK-24254:
--

 Summary: Eagerly evaluate some subqueries over LocalRelation
 Key: SPARK-24254
 URL: https://issues.apache.org/jira/browse/SPARK-24254
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Henry Robinson


Some queries would benefit from evaluating subqueries over {{LocalRelations}} 
eagerly. For example:

{code}
SELECT t1.part_col FROM t1 JOIN (SELECT max(part_col) m FROM t2) foo WHERE 
t1.part_col = foo.m
{code}

If {{max(part_col)}} could be evaluated during planning, there's an opportunity 
to prune all but at most one partitions from the scan of {{t1}}. 

Similarly, a near-identical query with a non-scalar subquery in the {{WHERE}} 
clause:

{code}
SELECT * FROM t1 WHERE part_col IN (SELECT part_col FROM t2)
{code}

could be partially evaluated to eliminate some partitions, and remove the join 
from the plan. 

Obviously all subqueries over local relations can't be evaluated during 
planning, but certain whitelisted aggregates could be if the input cardinality 
isn't too high. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22594) Handling spark-submit and master version mismatch

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22594:
--

Assignee: (was: Marcelo Vanzin)

> Handling spark-submit and master version mismatch
> -
>
> Key: SPARK-22594
> URL: https://issues.apache.org/jira/browse/SPARK-22594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Jiri Kremser
>Priority: Minor
>
> When using spark-submit in different version than the remote Spark master, 
> the execution fails on during the message deserialization with this log entry 
> / exception:
> {code}
> Error while invoking RpcHandler#receive() for one-way message.
> java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local 
> class incompatible: stream classdesc serialVersionUID = 1835832137613908542, 
> local class serialVersionUID = -1329125091869941550
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:271)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:320)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:270)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:269)
>   at 
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:604)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:655)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:647)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:209)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:114)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
> ...
> {code}
> This is quite ok and it can be read as version mismatch between the client 
> and server, however there is no such a message on the client (spark-submit) 
> side, so if the submitter doesn't have an access to the spark master or spark 
> UI, there is no way to figure out what is wrong. 
> I propose sending a {{RpcFailure}} message back from server to client with 
> some more informative error. I'd use the {{OneWayMessage}} instead of 
> {{RpcFailure}}, because there was no counterpart {{RpcRequest}}, but I had no 
> luck sending it using the {{reverseClient.send()}}. I think some internal 
> protocol is assumed when sending messages server2client.
> I have a patch prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22594) Handling spark-submit and master version mismatch

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22594:
--

Assignee: Marcelo Vanzin

> Handling spark-submit and master version mismatch
> -
>
> Key: SPARK-22594
> URL: https://issues.apache.org/jira/browse/SPARK-22594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Jiri Kremser
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> When using spark-submit in different version than the remote Spark master, 
> the execution fails on during the message deserialization with this log entry 
> / exception:
> {code}
> Error while invoking RpcHandler#receive() for one-way message.
> java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local 
> class incompatible: stream classdesc serialVersionUID = 1835832137613908542, 
> local class serialVersionUID = -1329125091869941550
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:271)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:320)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:270)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:269)
>   at 
> org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:604)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:655)
>   at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:647)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:209)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:114)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
> ...
> {code}
> This is quite ok and it can be read as version mismatch between the client 
> and server, however there is no such a message on the client (spark-submit) 
> side, so if the submitter doesn't have an access to the spark master or spark 
> UI, there is no way to figure out what is wrong. 
> I propose sending a {{RpcFailure}} message back from server to client with 
> some more informative error. I'd use the {{OneWayMessage}} instead of 
> {{RpcFailure}}, because there was no counterpart {{RpcRequest}}, but I had no 
> luck sending it using the {{reverseClient.send()}}. I think some internal 
> protocol is assumed when sending messages server2client.
> I have a patch prepared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24253:


Assignee: Apache Spark

> DataSourceV2: Add DeleteSupport for delete and overwrite operations
> ---
>
> Key: SPARK-24253
> URL: https://issues.apache.org/jira/browse/SPARK-24253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> Implementing delete and overwrite logical plans requires an API to delete 
> data from a data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472741#comment-16472741
 ] 

Apache Spark commented on SPARK-24253:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21308

> DataSourceV2: Add DeleteSupport for delete and overwrite operations
> ---
>
> Key: SPARK-24253
> URL: https://issues.apache.org/jira/browse/SPARK-24253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Implementing delete and overwrite logical plans requires an API to delete 
> data from a data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24253:


Assignee: (was: Apache Spark)

> DataSourceV2: Add DeleteSupport for delete and overwrite operations
> ---
>
> Key: SPARK-24253
> URL: https://issues.apache.org/jira/browse/SPARK-24253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Implementing delete and overwrite logical plans requires an API to delete 
> data from a data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations

2018-05-11 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-24253:
-

 Summary: DataSourceV2: Add DeleteSupport for delete and overwrite 
operations
 Key: SPARK-24253
 URL: https://issues.apache.org/jira/browse/SPARK-24253
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


Implementing delete and overwrite logical plans requires an API to delete data 
from a data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24186) add array reverse and concat

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472718#comment-16472718
 ] 

Apache Spark commented on SPARK-24186:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21307

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2018-05-11 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472714#comment-16472714
 ] 

Bryan Cutler commented on SPARK-22232:
--

I'm closing the PR for now, will reopen for Spark 3.0.0.  The fix makes the 
behavior consistent but might cause a breaking change for some users.  It's not 
necessary to put the fix in and safeguard with a config because there are 
workarounds.  For example, this is a workaround for the code in the description:

{code}
from pyspark.sql.types import *
from pyspark.sql import *

UnsortedRow = Row("a", "c", "b")

def toRow(i):
return UnsortedRow("1", 3.0, 2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

print rdd.repartition(3).toDF(schema).take(2)
# [Row(a=u'1', c=3.0, b=2), Row(a=u'1', c=3.0, b=2)]
{code}

> Row objects in pyspark created using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24252) DataSourceV2: Add catalog support

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472686#comment-16472686
 ] 

Apache Spark commented on SPARK-24252:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21306

> DataSourceV2: Add catalog support
> -
>
> Key: SPARK-24252
> URL: https://issues.apache.org/jira/browse/SPARK-24252
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 needs to support create and drop catalog operations in order to 
> support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24252:


Assignee: Apache Spark

> DataSourceV2: Add catalog support
> -
>
> Key: SPARK-24252
> URL: https://issues.apache.org/jira/browse/SPARK-24252
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 needs to support create and drop catalog operations in order to 
> support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24252:


Assignee: (was: Apache Spark)

> DataSourceV2: Add catalog support
> -
>
> Key: SPARK-24252
> URL: https://issues.apache.org/jira/browse/SPARK-24252
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 needs to support create and drop catalog operations in order to 
> support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24252) DataSourceV2: Add catalog support

2018-05-11 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-24252:
-

 Summary: DataSourceV2: Add catalog support
 Key: SPARK-24252
 URL: https://issues.apache.org/jira/browse/SPARK-24252
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue
 Fix For: 2.4.0


DataSourceV2 needs to support create and drop catalog operations in order to 
support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23321) DataSourceV2 should apply some validation when writing.

2018-05-11 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472675#comment-16472675
 ] 

Ryan Blue commented on SPARK-23321:
---

I've closed the PR associated with this because validating writes is going to 
get rolled into the new logical plans. I'm linking the new logical plan issues 
as dependencies of this one.

> DataSourceV2 should apply some validation when writing.
> ---
>
> Key: SPARK-23321
> URL: https://issues.apache.org/jira/browse/SPARK-23321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 writes are not validated. These writes should be validated using 
> the standard preprocess rules that are used for Hive and DataSource tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24251) DataSourceV2: Add AppendData logical operation

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472655#comment-16472655
 ] 

Apache Spark commented on SPARK-24251:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21305

> DataSourceV2: Add AppendData logical operation
> --
>
> Key: SPARK-24251
> URL: https://issues.apache.org/jira/browse/SPARK-24251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData 
> for inserting data in append mode. This is the simplest plan to implement 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24251) DataSourceV2: Add AppendData logical operation

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24251:


Assignee: Apache Spark

> DataSourceV2: Add AppendData logical operation
> --
>
> Key: SPARK-24251
> URL: https://issues.apache.org/jira/browse/SPARK-24251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData 
> for inserting data in append mode. This is the simplest plan to implement 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24251) DataSourceV2: Add AppendData logical operation

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24251:


Assignee: (was: Apache Spark)

> DataSourceV2: Add AppendData logical operation
> --
>
> Key: SPARK-24251
> URL: https://issues.apache.org/jira/browse/SPARK-24251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData 
> for inserting data in append mode. This is the simplest plan to implement 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24251) DataSourceV2: Add AppendData logical operation

2018-05-11 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-24251:
-

 Summary: DataSourceV2: Add AppendData logical operation
 Key: SPARK-24251
 URL: https://issues.apache.org/jira/browse/SPARK-24251
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue
 Fix For: 2.4.0


The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData for 
inserting data in append mode. This is the simplest plan to implement first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472581#comment-16472581
 ] 

Stavros Kontopoulos commented on SPARK-24232:
-

Cool makes sense.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:12 PM:
--

Ok I understand that need for users not to be surprised or we shouldnt break 
UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective compared 
to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also 
in the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well. Will proceed with a new property, thanx.


was (Author: skonto):
Ok I understand that need for users not to be surprised or we shouldnt break 
UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective compared 
to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also 
in the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well. Will proceed with a new property.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472574#comment-16472574
 ] 

Yinan Li commented on SPARK-24232:
--

As long as we document it clearly what is for, I think it's OK, particularly 
given that `secretKeyRef` is a well-known field name used by k8s.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:08 PM:
--

Ok I understand that need for users not to be surprised or we shouldnt break 
UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective compared 
to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also 
in the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well. Will proceed with a new property.


was (Author: skonto):
Ok I understand that need for users not to be surprised or we shouldnt break 
UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective compared 
to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also 
in the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572
 ] 

Stavros Kontopoulos commented on SPARK-24232:
-

Ok I understand that need for users not to be surprised or break something. A 
name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective to 
spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in 
the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:07 PM:
--

Ok I understand that need for users not to be surprised or we shouldnt break 
UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective compared 
to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also 
in the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well.


was (Author: skonto):
Ok I understand that need for users not to be surprised or break something. A 
name though like spark.kubernetes.driver.secretKeyRef.SomeName is not 
self-explanatory as it does not add much from a semantics perspective to 
spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in 
the Pods spec for env secrets, but personally I dont see it really that 
readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah 
executor secrets need fix as well.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:57 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]=

spark.kubernetes.\{driver, 
executor}.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix 
`spark.kubernetes.\{driver,executor}.secrets.` is now  considered a mount 
secret.

[~liyinan926] sounds ok?


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]=

spark.kubernetes.\{driver, 
executor}.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] sounds ok?

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472561#comment-16472561
 ] 

Yinan Li edited comment on SPARK-24232 at 5/11/18 7:55 PM:
---

We should keep the current semantics of 
`spark.kubernetes.driver.secrets.=`. The proposal you have 
above is likely confusing to existing users who already use 
`spark.kubernetes.driver.secrets.=`. It also makes the code 
unnecessarily complicated. Like what I said on Slack, it's better to do this 
through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. 
We also need the same for executors. See 
[http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management].


was (Author: liyinan926):
We should keep the current semantics of 
`spark.kubernetes.driver.secrets.=`. The proposal you have 
above is a breaking change for existing users who already use 
`spark.kubernetes.driver.secrets.=`. Like what I said on 
Slack, it's better to do this through a new property prefix, e.g., 
`spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. 
See 
http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472561#comment-16472561
 ] 

Yinan Li commented on SPARK-24232:
--

We should keep the current semantics of 
`spark.kubernetes.driver.secrets.=`. The proposal you have 
above is a breaking change for existing users who already use 
`spark.kubernetes.driver.secrets.=`. Like what I said on 
Slack, it's better to do this through a new property prefix, e.g., 
`spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. 
See 
http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:52 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]=

spark.kubernetes.\{driver, 
executor}.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] sounds ok?


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] sounds ok?

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:50 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] sounds ok?


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] From a first glance I don't see any executor secrets, no need for 
them? 

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:48 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything starting with prefix `spark.kubernetes.driver.secrets.` 
is now  considered a mount secret.

[~liyinan926] From a first glance I don't see any executor secrets, no need for 
them? 


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything under prefix spark.kubernetes.driver.secrets. is now 
mount secrets.

[~liyinan926] From a first glance I don't see any executor secrets, no need for 
them? 

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:47 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything under prefix spark.kubernetes.driver.secrets. is now 
mount secrets.

[~liyinan926] From a first glance I don't see any executor secrets, no need for 
them? 


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything under prefix spark.kubernetes.driver.secrets. is now 
mount secrets.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:44 PM:
--

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

By default everything under prefix spark.kubernetes.driver.secrets. is now 
mount secrets.


was (Author: skonto):
[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-11 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551
 ] 

Stavros Kontopoulos commented on SPARK-24232:
-

[~dharmesh.kakadia] I am working on adding something like the following:

spark.kubernetes.driver.secrets.envKey.[EnvSecretName]=

spark.kubernetes.driver.secrets.mountKey.[mountSecretName]=

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10145.

Resolution: Unresolved

I'm closing this since a lot in this area has changed since this bug was 
reported. If issues like this still happen they're bound to look a lot 
different, so we'd need new (and more complete) logs to debug them.

> Executor exit without useful messages when spark runs in spark-streaming
> 
>
> Key: SPARK-10145
> URL: https://issues.apache.org/jira/browse/SPARK-10145
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
> Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 
> cores and 32g memory  
>Reporter: Baogang Wang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Each node is allocated 30g memory by Yarn.
> My application receives messages from Kafka by directstream. Each application 
> consists of 4 dstream window
> Spark application is submitted by this command:
> spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g 
> --executor-memory 3g --num-executors 3 --executor-cores 4  --name 
> safeSparkDealerUser --master yarn  --deploy-mode cluster  
> spark_Security-1.0-SNAPSHOT.jar.nocalse 
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
> After about 1 hours, some executor exits. There is no more yarn logs after 
> the executor exits and there is no stack when the executor exits.
> When I see the yarn node manager log, it shows as follows :
> 2015-08-17 17:25:41,550 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1439803298368_0005_01_01 by user root
> 2015-08-17 17:25:41,551 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Creating a new application reference for app application_1439803298368_0005
> 2015-08-17 17:25:41,551 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root   
> IP=172.19.160.102   OPERATION=Start Container Request   
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1439803298368_0005
> CONTAINERID=container_1439803298368_0005_01_01
> 2015-08-17 17:25:41,551 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1439803298368_0005 transitioned from NEW to INITING
> 2015-08-17 17:25:41,552 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Adding container_1439803298368_0005_01_01 to application 
> application_1439803298368_0005
> 2015-08-17 17:25:41,557 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  rollingMonitorInterval is set as -1. The log rolling mornitoring interval is 
> disabled. The logs will be aggregated after this application is finished.
> 2015-08-17 17:25:41,663 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Application application_1439803298368_0005 transitioned from INITING to 
> RUNNING
> 2015-08-17 17:25:41,664 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1439803298368_0005_01_01 transitioned from NEW to 
> LOCALIZING
> 2015-08-17 17:25:41,664 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_INIT for appId application_1439803298368_0005
> 2015-08-17 17:25:41,664 INFO 
> org.apache.spark.network.yarn.YarnShuffleService: Initializing container 
> container_1439803298368_0005_01_01
> 2015-08-17 17:25:41,665 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar
>  transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
>  Resource 
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar
>  transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Created localizer for container_1439803298368_0005_01_01
> 2015-08-17 17:25:41,668 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Writing credentials to the nmPrivate file 
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.t

[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-05-11 Thread Dylan Guedes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472514#comment-16472514
 ] 

Dylan Guedes commented on SPARK-23931:
--

[~mn-mikke] I updated with a working version! Would you mind in giving a 
feedback/suggestion? I've decided to use an array of structs since Java doesn't 
handle well Scala Tuple2's, but to be fair I'm not sure if it is the best 
choice.

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472494#comment-16472494
 ] 

Apache Spark commented on SPARK-23852:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21302

> Parquet MR bug can lead to incorrect SQL results
> 
>
> Key: SPARK-23852
> URL: https://issues.apache.org/jira/browse/SPARK-23852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that 
> pushing certain predicates to Parquet scanners can return fewer results than 
> they should.
> The bug triggers in Spark when:
>  * The Parquet file being scanner has stats for the null count, but not the 
> max or min on the column with the predicate (Apache Impala writes files like 
> this).
>  * The vectorized Parquet reader path is not taken, and the parquet-mr reader 
> is used.
>  * A suitable <, <=, > or >= predicate is pushed down to Parquet.
> The bug is that the parquet-mr interprets the max and min of a row-group's 
> column as 0 in the absence of stats. So {{col > 0}} will filter all results, 
> even if some are > 0.
> There is no upstream release of Parquet that contains the fix for 
> PARQUET-1217, although a 1.10 release is planned.
> The least impactful workaround is to set the Parquet configuration 
> {{parquet.filter.stats.enabled}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13007) Document where configuration / properties are read and applied

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-13007:
---
Priority: Major  (was: Critical)

> Document where configuration / properties are read and applied
> --
>
> Key: SPARK-13007
> URL: https://issues.apache.org/jira/browse/SPARK-13007
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>Priority: Major
>
> While spark is well documented for the most part, often times I have trouble 
> determining where a configuration applies.
> For example, when setting spark.dynamicAllocation.enabled , does it 
> always apply to the entire cluster manager, or is it possible to configure it 
> on a per-job level?
> Different levels I can think of:
> Application
> Driver
> Executor
> Worker
> Cluster
> And I'm sure there are more.  This could be just another column in the 
> configuration page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-9139:
--
Priority: Major  (was: Critical)

> Add backwards-compatibility tests for DataType.fromJson()
> -
>
> Key: SPARK-9139
> URL: https://issues.apache.org/jira/browse/SPARK-9139
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Josh Rosen
>Priority: Major
>
> SQL's DataType.fromJson is a public API and thus must be 
> backwards-compatible; there are also backwards-compatibility concerns related 
> to persistence of DataType JSON in metastores.
> Unfortunately, we do not have any backwards-compatibility tests which attempt 
> to read old JSON values that were written by earlier versions of Spark.  
> DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this 
> doesn't ensure compatibility.
> I think that we should address this by capuring the JSON strings produced in 
> Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes 
> from those strings.
> This might be a good starter task for someone who wants to contribute to SQL 
> tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8487) Update reduceByKeyAndWindow docs to highlight that filtering Function must be used

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-8487:
--
Priority: Major  (was: Critical)

> Update reduceByKeyAndWindow docs to highlight that filtering Function must be 
> used
> --
>
> Key: SPARK-8487
> URL: https://issues.apache.org/jira/browse/SPARK-8487
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-3528:
--
Priority: Major  (was: Critical)

> Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
> 
>
> Key: SPARK-3528
> URL: https://issues.apache.org/jira/browse/SPARK-3528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Major
>
> Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task
> {noformat}
> spark> sc.textFile("pom.xml").count
> ...
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, PROCESS_LOCAL, 1191 bytes)
> 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:20862+20863
> 14/09/15 00:59:13 INFO HadoopRDD: Input split: 
> file:/Users/aash/git/spark/pom.xml:0+20862
> {noformat}
> There is an outstanding TODO in {{HadoopRDD.scala}} that may be related:
> {noformat}
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> // TODO: Filtering out "localhost" in case of file:// URLs
> val hadoopSplit = split.asInstanceOf[HadoopPartition]
> hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21758:
---
Priority: Major  (was: Critical)

> `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*
> --
>
> Key: SPARK-21758
> URL: https://issues.apache.org/jira/browse/SPARK-21758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>Priority: Major
>
> SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts")
> Exception: Table test_db.test.tb does not have property: 
> spark.sql.sources.schema.numParts
> The `spark.sql.sources.schema.numParts` property exactly exists in 
> HiveMetastore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-05-11 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-24067.

   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 21300
[https://github.com/apache/spark/pull/21300]

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
> Fix For: 2.3.1
>
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23771) Uneven Rowgroup size after repartition

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23771:
---
Priority: Major  (was: Critical)

> Uneven Rowgroup size after repartition
> --
>
> Key: SPARK-23771
> URL: https://issues.apache.org/jira/browse/SPARK-23771
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Shuffle, SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I have a Hive table on AVRO files, that i want to read and store as a 
> partitioned parquet files (one file per partition).
> What i do is:
> {code:java}
> // read the AVRO table and distribute by the partition column
> val data = sql("select * from avro_table distribute by part_col")
>  
> // write data as partitioned parquet files
> data.write.partitionBy(part_col).parquet("output/path/")
> {code}
>  
> I get one file per partition as expected. But often I run into OutOfMemory 
> Errors. Investigating the issue I found out, that some row groups are very 
> big and since all data of a row group is held in memory before it is flushed 
> to disk, i think this causes the OutOfMemory. Other row groups are very 
> small, containing almost no data. See the output from parquet-tools meta:
>  
> {code:java}
> row group 1: RC:5740100 TS:566954562 OFFSET:4 
> row group 2: RC:33769 TS:2904145 OFFSET:117971092 
> row group 3: RC:31822 TS:2772650 OFFSET:118905225 
> row group 4: RC:29854 TS:2704127 OFFSET:119793188 
> row group 5: RC:28050 TS:2356729 OFFSET:120660675 
> row group 6: RC:26507 TS:2111983 OFFSET:121406541 
> row group 7: RC:25143 TS:1967731 OFFSET:122069351 
> row group 8: RC:23876 TS:1991238 OFFSET:122682160 
> row group 9: RC:22584 TS:2069463 OFFSET:123303246 
> row group 10: RC:21225 TS:1955748 OFFSET:123960700 
> row group 11: RC:19960 TS:1931889 OFFSET:124575333 
> row group 12: RC:18806 TS:1725871 OFFSET:125132862 
> row group 13: RC:17719 TS:1653309 OFFSET:125668057 
> row group 14: RC:1617743 TS:157973949 OFFSET:134217728{code}
>  
> One thing to notice is, that this file was written in a Spark application 
> running on 13 executors. Is it possible, that local data is in the big row 
> group and the remote reads go into seperate (small) row groups? The shuffle 
> is involved, because data is read with distribute by clause.
>  
> Is this a known bug? Is there a workaround to get even row group sizes? I 
> want to decrease the row group size using 
> sc.hadoopConfiguration.setInt("parquet.block.size", 64 * 1024 * 1024)
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23606) Flakey FileBasedDataSourceSuite

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23606:
---
Priority: Major  (was: Critical)

> Flakey FileBasedDataSourceSuite
> ---
>
> Key: SPARK-23606
> URL: https://issues.apache.org/jira/browse/SPARK-23606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Major
>
> I've seen the following exception twice today in PR builds (one example: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87978/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/).
>  It's not deterministic, as I've had one PR build pass in the same span.
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.016101897 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 
> 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesy

[jira] [Resolved] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release

2018-05-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24229.

Resolution: Not A Problem

That affects the "Apache Thrift Go client library", which is not used by Spark.

> Upgrade to the latest Apache Thrift 0.10.0 release
> --
>
> Key: SPARK-24229
> URL: https://issues.apache.org/jira/browse/SPARK-24229
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ray Donnelly
>Priority: Critical
>
> According to [https://www.cvedetails.com/cve/CVE-2016-5397/]
>  
> .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored 
> in Apache Spark (and then, for us, into PySpark).
>  
> Can anyone help to assess the seriousness of this and what should be done 
> about it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-11 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472405#comment-16472405
 ] 

Edwina Lu commented on SPARK-23206:
---

The design discussion for SPARK-23206 is scheduled for Monday, May 15 at 11am 
PDT (6pm UTC). 

https://linkedin.bluejeans.com/1886759322/

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2018-05-11 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472393#comment-16472393
 ] 

Marcelo Vanzin commented on SPARK-20922:


You should also be able to use just the spark-launcher library from a 2.x 
version to launch Spark 1.6 jobs, without having to update the rest of the 
Spark dependencies.

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-11 Thread smohr003 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472326#comment-16472326
 ] 

smohr003 commented on SPARK-24233:
--

added

> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced dfAll is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-05-11 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472298#comment-16472298
 ] 

Mihaly Toth commented on SPARK-22918:
-

Yep, probably anybody who introduces a SecurityManager needs to grant the above 
permission. I am just wondering why such implementation details are propagating 
up in the architecture. Is there any added level of security in addition to the 
file system level security below derby? Should not there be at least a 
configuration option to disable such security checks?

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187

[jira] [Commented] (SPARK-21569) Internal Spark class needs to be kryo-registered

2018-05-11 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472295#comment-16472295
 ] 

Ted Yu commented on SPARK-21569:


What would be workaround ?

Thanks

> Internal Spark class needs to be kryo-registered
> 
>
> Key: SPARK-21569
> URL: https://issues.apache.org/jira/browse/SPARK-21569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]
> As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
> {{spark.kryo.registrationRequired=true}}) with:
> {code}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This internal Spark class should be kryo-registered by Spark by default.
> This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24172.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> we should not apply operator pushdown to data source v2 many times
> --
>
> Key: SPARK-24172
> URL: https://issues.apache.org/jira/browse/SPARK-24172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2018-05-11 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472231#comment-16472231
 ] 

Marcelo Vanzin commented on SPARK-20922:


I think Spark 1.6 at this point is considered EOL by the community; there are 
no more planned releases for that line that I know of.

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24220) java.lang.NullPointerException at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)

2018-05-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472195#comment-16472195
 ] 

Kazuaki Ishizaki commented on SPARK-24220:
--

Thank you for reporting an issue. Would it be possible to post standalone 
reproduable program? This program seems to connect to an external database or 
something thru {{DriverManager.getConnection(adminUrl)}}.

> java.lang.NullPointerException at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
> 
>
> Key: SPARK-24220
> URL: https://issues.apache.org/jira/browse/SPARK-24220
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
>
> def getInputStream(rows:Iterator[Row]): PipedInputStream ={
>  printMem("before gen string")
>  val pipedOutputStream = new PipedOutputStream()
>  (new Thread() {
>  override def run(){
>  if(rows == null){
>  logError("rows is null==>")
>  }else{
>  println(s"record-start-${rows.length}")
>  try {
>  while (rows.hasNext) {
>  val row = rows.next()
>  println(row)
>  val str = row.mkString("\001") + "\r\n"
>  println(str)
>  pipedOutputStream.write(str.getBytes(StandardCharsets.UTF_8))
>  }
>  println("record-end-")
>  pipedOutputStream.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  }
>  }
>  }
>  }).start()
>  println("pipedInPutStream--")
>  val pipedInPutStream = new PipedInputStream()
>  pipedInPutStream.connect(pipedOutputStream)
>  println("pipedInPutStream--- conn---")
>  printMem("after gen string")
>  pipedInPutStream
> }
> resDf.coalesce(15).foreachPartition(rows=>{
>  if(rows == null){
>  logError("rows is null=>")
>  }else{
>  val copyCmd = s"COPY ${tableName} FROM STDIN with DELIMITER as '\001' NULL 
> as 'null string'"
>  var con: Connection = null
>  try {
>  con = DriverManager.getConnection(adminUrl)
>  val copyManager = new CopyManager(con.asInstanceOf[BaseConnection])
>  val start = System.currentTimeMillis()
>  var count: Long = 0
>  var copyCount: Long = 0
>  println("before copyManager=>")
>  copyCount += copyManager.copyIn(copyCmd, getInputStream(rows))
>  println("after copyManager=>")
>  val finish = System.currentTimeMillis()
>  println("copyCount:" + copyCount + " count:" + count + " time(s):" + (finish 
> - start) / 1000)
>  con.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  } finally {
>  try {
>  if (con != null) {
>  con.close()
>  }
>  } catch {
>  case ex:SQLException =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  }
>  }
>  }
>  
> 18/05/09 13:31:30 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[Thread-4,5,main]
> java.lang.NullPointerException
>  at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
>  at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>  at 
> org.apache.spark.storage.BlockManager$

[jira] [Assigned] (SPARK-24228) Fix the lint error

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24228:


Assignee: Apache Spark

> Fix the lint error
> --
>
> Key: SPARK-24228
> URL: https://issues.apache.org/jira/browse/SPARK-24228
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8]
>  (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8]
>  (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24228) Fix the lint error

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24228:


Assignee: (was: Apache Spark)

> Fix the lint error
> --
>
> Key: SPARK-24228
> URL: https://issues.apache.org/jira/browse/SPARK-24228
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Minor
>
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8]
>  (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8]
>  (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24228) Fix the lint error

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472187#comment-16472187
 ] 

Apache Spark commented on SPARK-24228:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21301

> Fix the lint error
> --
>
> Key: SPARK-24228
> URL: https://issues.apache.org/jira/browse/SPARK-24228
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Minor
>
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8]
>  (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8]
>  (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-05-11 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472169#comment-16472169
 ] 

Marek Novotny commented on SPARK-23931:
---

Ok. Good luck!

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-05-11 Thread Dylan Guedes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472161#comment-16472161
 ] 

Dylan Guedes commented on SPARK-23931:
--

Hi Marek! I finally get some progress, I think that more a few hours and I can 
complete this. Thank you!

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-05-11 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472157#comment-16472157
 ] 

Marek Novotny commented on SPARK-23931:
---

[~DylanGuedes] Any joy? I can take this one if you want.

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation

2018-05-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22900.
---
Resolution: Not A Problem

> remove unnecessary restrict for streaming dynamic allocation
> 
>
> Key: SPARK-22900
> URL: https://issues.apache.org/jira/browse/SPARK-22900
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: sharkd tu
>Priority: Major
>
> When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the 
> conf `num-executors` can not be set. As a result, it will allocate default 2 
> executors and all receivers will be run on this 2 executors, there may not be 
> redundant cpu cores for tasks. it will stuck all the time.
> in my opinion, we should remove unnecessary restrict for streaming dynamic 
> allocation. we can set `num-executors` and 
> `spark.streaming.dynamicAllocation.enabled=true` together. when application 
> starts, each receiver will be run on an executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2018-05-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22470.
---
Resolution: Won't Fix

> Doc that functions.hash is also used internally for shuffle and bucketing
> -
>
> Key: SPARK-22470
> URL: https://issues.apache.org/jira/browse/SPARK-22470
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
> appears to be the same hash function as what Spark uses internally for 
> shuffle and bucketing.
> One of my users would like to bake this assumption into code, but is unsure 
> if it's a guarantee or a coincidence that they're the same function.  Would 
> it be considered an API break if at some point the two functions were 
> different, or if the implementation of both changed together?
> We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13158) Show the information of broadcast blocks in WebUI

2018-05-11 Thread David Moravek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472114#comment-16472114
 ] 

David Moravek commented on SPARK-13158:
---

Hello, is there any reason this didn't get merged (I'd like to reopen this)? It 
would be really helpful for debugging.

> Show the information of broadcast blocks in WebUI
> -
>
> Key: SPARK-13158
> URL: https://issues.apache.org/jira/browse/SPARK-13158
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets a function to show the information of broadcast blocks, # 
> of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8605) Exclude files in StreamingContext. textFileStream(directory)

2018-05-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8605.
--
Resolution: Won't Fix

> Exclude files in StreamingContext. textFileStream(directory)
> 
>
> Key: SPARK-8605
> URL: https://issues.apache.org/jira/browse/SPARK-8605
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Noel Vo
>Priority: Major
>  Labels: streaming, streaming_api
>
> Currenly, spark streaming can monitor a directory and it will process the 
> newly added files. This will cause a bug if the files copied to the directory 
> are big. For example, in hdfs, if a file is being copied, its name is 
> file_name._COPYING_. Spark will pick up the file and process. However, when 
> it's done copying the file, the file name becomes file_name. This would cause 
> FileDoesNotExist error. It would be great if we can exclude files using regex 
> in the directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24067:


Assignee: Apache Spark  (was: Cody Koeninger)

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24067:


Assignee: Cody Koeninger  (was: Apache Spark)

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472071#comment-16472071
 ] 

Apache Spark commented on SPARK-24067:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/21300

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24179) History Server for Kubernetes

2018-05-11 Thread Abhishek Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472035#comment-16472035
 ] 

Abhishek Rao commented on SPARK-24179:
--

We have brought up Spark History Server on Kubernetes using following approach. 
1) Prepare History-server docker container.

    a. Inherit Dockerfile from Kubespark/spark-base.
    b. Set Environment variable SPARK_NO_DAEMONIZE so that history-server can 
start as a Daemon.
    c. As part of Docker entry point, we’re invoking start-history-server.sh

2) Bringup Kubernetes pod using the docker image built in step 1
3) Bringup Kubernetes service for the histoty-server pod with type ClusterIP or 
NodePort.
4) If we bringup service with NodePort, History server can be accessed using 
NodeIP:NodePort
5) If we bringup service with ClusterIP, then we need to create Kubernetes 
ingress which will forward the request to the service brought up in step 3. 
Once we have the ingress up, history server can be accessed using Edge Node IP 
with the “path” specified in ingress.
6) Only limitation with ingress is that redirection is happening only when we 
use “/” in path. If we use any other string, redirection is not happening

This is validated using spark binaries from apache-spark-on-k8s forked from 
spark 2.2 as well as apache spark 2.3. Attached are the screen shots for both 
versions of spark.

> History Server for Kubernetes
> -
>
> Key: SPARK-24179
> URL: https://issues.apache.org/jira/browse/SPARK-24179
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Eric Charles
>Priority: Major
> Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG
>
>
> The History server is missing when running on Kubernetes, with the side 
> effect we can not debug post-mortem or analyze after-the-fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24179) History Server for Kubernetes

2018-05-11 Thread Abhishek Rao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rao updated SPARK-24179:
-
Attachment: Spark2_3_History_Server.PNG

> History Server for Kubernetes
> -
>
> Key: SPARK-24179
> URL: https://issues.apache.org/jira/browse/SPARK-24179
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Eric Charles
>Priority: Major
> Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG
>
>
> The History server is missing when running on Kubernetes, with the side 
> effect we can not debug post-mortem or analyze after-the-fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24179) History Server for Kubernetes

2018-05-11 Thread Abhishek Rao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rao updated SPARK-24179:
-
Attachment: Spark2_2_History_Server.PNG

> History Server for Kubernetes
> -
>
> Key: SPARK-24179
> URL: https://issues.apache.org/jira/browse/SPARK-24179
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Eric Charles
>Priority: Major
> Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG
>
>
> The History server is missing when running on Kubernetes, with the side 
> effect we can not debug post-mortem or analyze after-the-fact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24250) support accessing SQLConf inside tasks

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24250:


Assignee: Apache Spark  (was: Wenchen Fan)

> support accessing SQLConf inside tasks
> --
>
> Key: SPARK-24250
> URL: https://issues.apache.org/jira/browse/SPARK-24250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24250) support accessing SQLConf inside tasks

2018-05-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471891#comment-16471891
 ] 

Apache Spark commented on SPARK-24250:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21299

> support accessing SQLConf inside tasks
> --
>
> Key: SPARK-24250
> URL: https://issues.apache.org/jira/browse/SPARK-24250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24250) support accessing SQLConf inside tasks

2018-05-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24250:


Assignee: Wenchen Fan  (was: Apache Spark)

> support accessing SQLConf inside tasks
> --
>
> Key: SPARK-24250
> URL: https://issues.apache.org/jira/browse/SPARK-24250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24250) support accessing SQLConf inside tasks

2018-05-11 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-24250:
---

 Summary: support accessing SQLConf inside tasks
 Key: SPARK-24250
 URL: https://issues.apache.org/jira/browse/SPARK-24250
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-05-11 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471812#comment-16471812
 ] 

Li Yuanjian commented on SPARK-23128:
-

I collected some user cases and performance improve effect during Baidu 
internal usage of this patch, summarize as following 3 scenario:
1. SortMergeJoin to BroadcastJoin
The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us 
{color:red}50% to 200%{color} boosting on query performance, and this strategy 
alway hit the BI scenario like join several tables with filter strategy in 
subquery
2. Long running application or use Spark as a service
In this case, long running application refers to the duration of application 
near 1 hour. Using Spark as a service refers to use spark-shell and keep submit 
sql or use the service of Spark like Zeppelin, Livy or our internal sql service 
Baidu BigSQL. In such scenario, all spark jobs share same partition number, so 
enable AE and add configs about expected task info including data size, row 
number, min\max partition number and etc, will bring us 
{color:red}50%-100%{color} boosting on performance improvement.
3. GraphFrame jobs
The last scenario is the application use GraphFrame, in this case, user has a 
2-dimension graph with 1 billion edges, use the connected componentsalgorithm 
in GraphFrame. With enabling AE, the duration of app reduce from 58min to 
32min, almost {color:red}100%{color} boosting on performance improvement.

The detailed screenshot and config in the attached pdf. 

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-05-11 Thread Li Yuanjian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23128:

Attachment: AdaptiveExecutioninBaidu.pdf

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-05-11 Thread Li Yuanjian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23128:

Comment: was deleted

(was: I collected some user cases and performance improve effect during Baidu 
internal usage of this patch, summarize as following 3 scenario:
1. SortMergeJoin to BroadcastJoin
The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us 
{color:red}50% to 200%{color} boosting on query performance, and this strategy 
alway hit the BI scenario like join several tables with filter strategy in 
subquery
2. Long running application or use Spark as a service
In this case, long running application refers to the duration of application 
near 1 hour. Using Spark as a service refers to use spark-shell and keep submit 
sql or use the service of Spark like Zeppelin, Livy or our internal sql service 
Baidu BigSQL. In such scenario, all spark jobs share same partition number, so 
enable AE and add configs about expected task info including data size, row 
number, min\max partition number and etc, will bring us 
{color:red}50%-100%{color} boosting on performance improvement.
3. GraphFrame jobs
The last scenario is the application use GraphFrame, in this case, user has a 
2-dimension graph with 1 billion edges, use the connected componentsalgorithm 
in GraphFrame. With enabling AE, the duration of app reduce from 58min to 
32min, almost {color:red}100%{color} boosting on performance improvement.

The detailed screenshot and config in the attached pdf. )

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-05-11 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471811#comment-16471811
 ] 

Li Yuanjian commented on SPARK-23128:
-

I collected some user cases and performance improve effect during Baidu 
internal usage of this patch, summarize as following 3 scenario:
1. SortMergeJoin to BroadcastJoin
The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us 
{color:red}50% to 200%{color} boosting on query performance, and this strategy 
alway hit the BI scenario like join several tables with filter strategy in 
subquery
2. Long running application or use Spark as a service
In this case, long running application refers to the duration of application 
near 1 hour. Using Spark as a service refers to use spark-shell and keep submit 
sql or use the service of Spark like Zeppelin, Livy or our internal sql service 
Baidu BigSQL. In such scenario, all spark jobs share same partition number, so 
enable AE and add configs about expected task info including data size, row 
number, min\max partition number and etc, will bring us 
{color:red}50%-100%{color} boosting on performance improvement.
3. GraphFrame jobs
The last scenario is the application use GraphFrame, in this case, user has a 
2-dimension graph with 1 billion edges, use the connected componentsalgorithm 
in GraphFrame. With enabling AE, the duration of app reduce from 58min to 
32min, almost {color:red}100%{color} boosting on performance improvement.

The detailed screenshot and config in the attached pdf. 

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-11 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24182:
---

Assignee: Marcelo Vanzin

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24182) Improve error message for client mode when AM fails

2018-05-11 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24182.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21243
[https://github.com/apache/spark/pull/21243]

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2018-05-11 Thread Ruslan Fialkovsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471600#comment-16471600
 ] 

Ruslan Fialkovsky commented on SPARK-20922:
---

I can't update Spark to 2.2.

Will you make fix for Spark 1.6 branch ?

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.

2018-05-11 Thread kaushik srinivas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaushik srinivas updated SPARK-24249:
-
Description: 
Below is the scenario being tested,

Job :
 Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA 
which is in parquet,snappy format and hive tables created on top of it.

Cluster manager :
 Kubernetes

Spark sql configuration :

Set 1 :
 spark.executor.heartbeatInterval 20s
 spark.executor.cores 4
 spark.driver.cores 4
 spark.driver.memory 15g
 spark.executor.memory 15g
 spark.cores.max 220
 spark.rpc.numRetries 5
 spark.rpc.retry.wait 5
 spark.network.timeout 1800
 spark.sql.broadcastTimeout 1200
 spark.sql.crossJoin.enabled true
 spark.sql.starJoinOptimization true
 spark.eventLog.enabled true
 spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
 spark.sql.codegen true
 spark.kubernetes.allocation.batch.size 30

Set 2 :
 spark.executor.heartbeatInterval 20s
 spark.executor.cores 4
 spark.driver.cores 4
 spark.driver.memory 11g
 spark.driver.memoryOverhead 4g
 spark.executor.memory 11g
 spark.executor.memoryOverhead 4g
 spark.cores.max 220
 spark.rpc.numRetries 5
 spark.rpc.retry.wait 5
 spark.network.timeout 1800
 spark.sql.broadcastTimeout 1200
 spark.sql.crossJoin.enabled true
 spark.sql.starJoinOptimization true
 spark.eventLog.enabled true
 spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
 spark.sql.codegen true
 spark.kubernetes.allocation.batch.size 30

Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 
64mb.
 50 executors are being spawned using spark.executor.instances=50 submit 
argument.

Issues Observed:

Spark sql job is terminating abruptly and the drivers,executors are being 
killed randomly.
 driver and executors pods gets killed suddenly the job fails.

Few different stack traces are found across different runs,

Stack Trace 1:
 "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
 org.apache.spark.SparkException: Exception thrown in awaitResult:
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
 File attached : [^StackTrace1.txt]

Stack Trace 2: 
 "org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/192.178.1.105:38039^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
 File attached : [^StackTrace2.txt]

Stack Trace 3:
 "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 
(TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, 
mapId=-1, reduceId=3, message=^M
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 29^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"
 File attached : [^StackTrace3.txt]

Stack Trace 4:
 "ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: 
Executor lost for unknown reasons."
 This is repeating constantly until the executors are dead completely without 
any stack traces.

File attached : [^StackTrace4.txt]

Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set()
 what does this mean ? anything is wrong or it says failed set is empty that 
means no failure ?

Observations or changes tried out :
 > Monitored memory and CPU utilisation across executors and none of them are 
 > hitting the limits.
 > As per few readings and suggestions 
 spark.network.timeout was increased to 1800 from 600, but did not help.
 > Also, driver and executor memory overhead was kept default in set 1 of the 
 > config and it was 0.1*15g=1.5gb.
 Increased this value also, explicitly to 4gb and reduced driver and executor 
memory values to 11gb from 15gb as per set 2.
 this did not yield any valuable results, same failures are being observed.

SparkSql is being used to run the queries,
 sample code lines :
 val qresult = spark.sql(q)
 qresult.show()
 No manual repartitioning is being done in the code.

 

 

  was:
Below is the scenario being tested,

Job :
 Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA 
which is in parquet,snappy format and hive tables created on top of it.

Cluster manager :
 Kubernetes

Spark sql configuration :

Set 1 :
 spark.executor.heartbeatInterval 20s
 spark.executor.cores 4
 spark.driver.cores 4
 spark.driver.memory 15g
 spark.executor.memory 15g
 spark.cores.max 220
 spark.rpc.numRetries 5
 spark.rpc.retry.wait 5
 spark.network.timeout 1800
 spark.sql.broadcastTimeout 1200
 spark.sql.crossJoin.enabled true
 spark.sql.starJoinOptimization true
 spark.eventLog.enabled true
 spark

[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.

2018-05-11 Thread kaushik srinivas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaushik srinivas updated SPARK-24249:
-
Description: 
Below is the scenario being tested,

Job :
 Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA 
which is in parquet,snappy format and hive tables created on top of it.

Cluster manager :
 Kubernetes

Spark sql configuration :

Set 1 :
 spark.executor.heartbeatInterval 20s
 spark.executor.cores 4
 spark.driver.cores 4
 spark.driver.memory 15g
 spark.executor.memory 15g
 spark.cores.max 220
 spark.rpc.numRetries 5
 spark.rpc.retry.wait 5
 spark.network.timeout 1800
 spark.sql.broadcastTimeout 1200
 spark.sql.crossJoin.enabled true
 spark.sql.starJoinOptimization true
 spark.eventLog.enabled true
 spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
 spark.sql.codegen true
 spark.kubernetes.allocation.batch.size 30

Set 2 :
 spark.executor.heartbeatInterval 20s
 spark.executor.cores 4
 spark.driver.cores 4
 spark.driver.memory 11g
 spark.driver.memoryOverhead 4g
 spark.executor.memory 11g
 spark.executor.memoryOverhead 4g
 spark.cores.max 220
 spark.rpc.numRetries 5
 spark.rpc.retry.wait 5
 spark.network.timeout 1800
 spark.sql.broadcastTimeout 1200
 spark.sql.crossJoin.enabled true
 spark.sql.starJoinOptimization true
 spark.eventLog.enabled true
 spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
 spark.sql.codegen true
 spark.kubernetes.allocation.batch.size 30

Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 
64mb.
 50 executors are being spawned using spark.executor.instances=50 submit 
argument.

Issues Observed:

Spark sql job is terminating abruptly and the drivers,executors are being 
killed randomly.
 driver and executors pods gets killed suddenly the job fails.

Few different stack traces are found across different runs,

Stack Trace 1:
 "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
 org.apache.spark.SparkException: Exception thrown in awaitResult:
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
 File attached : [^StackTrace1.txt]

Stack Trace 2: 
 "org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/192.178.1.105:38039^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
 File attached : [^StackTrace2.txt]

Stack Trace 3:
 "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 
(TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, 
mapId=-1, reduceId=3, message=^M
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 29^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"
 File attached : [^StackTrace3.txt]

Stack Trace 4:
 "ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: 
Executor lost for unknown reasons."
 This is repeating constantly until the executors are dead completely without 
any stack traces.

[^StackTrace4.txt]

Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set()
 what does this mean ? anything is wrong or it says failed set is empty that 
means no failure ?

Observations or changes tried out :
 > Monitored memory and CPU utilisation across executors and none of them are 
 > hitting the limits.
 > As per few readings and suggestions 
 spark.network.timeout was increased to 1800 from 600, but did not help.
 > Also, driver and executor memory overhead was kept default in set 1 of the 
 > config and it was 0.1*15g=1.5gb.
 Increased this value also, explicitly to 4gb and reduced driver and executor 
memory values to 11gb from 15gb as per set 2.
 this did not yield any valuable results, same failures are being observed.

SparkSql is being used to run the queries,
 sample code lines :
 val qresult = spark.sql(q)
 qresult.show()
 No manual repartitioning is being done in the code.

 

 

  was:
Below is the scenario being tested,

Job :
Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which 
is in parquet,snappy format and hive tables created on top of it.

Cluster manager :
Kubernetes

Spark sql configuration :

Set 1 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/

[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.

2018-05-11 Thread kaushik srinivas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaushik srinivas updated SPARK-24249:
-
Attachment: StackTrace4.txt
StackTrace3.txt
StackTrace2.txt
StackTrace1.txt

> Spark on kubernetes, pods crashes with spark sql job.
> -
>
> Key: SPARK-24249
> URL: https://issues.apache.org/jira/browse/SPARK-24249
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.2.0
> Environment: Spark version : spark-2.2.0-k8s-0.5.0-bin-2.7.3
> Kubernetes version : Kubernetes 1.9.7
> Spark sql configuration :
> Set 1 :
> spark.executor.heartbeatInterval 20s
> spark.executor.cores 4
> spark.driver.cores 4
> spark.driver.memory 15g
> spark.executor.memory 15g
> spark.cores.max 220
> spark.rpc.numRetries 5
> spark.rpc.retry.wait 5
> spark.network.timeout 1800
> spark.sql.broadcastTimeout 1200
> spark.sql.crossJoin.enabled true
> spark.sql.starJoinOptimization true
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
> spark.sql.codegen true
> spark.kubernetes.allocation.batch.size 30
> Set 2 :
> spark.executor.heartbeatInterval 20s
> spark.executor.cores 4
> spark.driver.cores 4
> spark.driver.memory 11g
> spark.driver.memoryOverhead 4g
> spark.executor.memory 11g
> spark.executor.memoryOverhead 4g
> spark.cores.max 220
> spark.rpc.numRetries 5
> spark.rpc.retry.wait 5
> spark.network.timeout 1800
> spark.sql.broadcastTimeout 1200
> spark.sql.crossJoin.enabled true
> spark.sql.starJoinOptimization true
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
> spark.sql.codegen true
> spark.kubernetes.allocation.batch.size 30
> Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value 
> of 64mb.
> 50 executors are being spawned using spark.executor.instances=50 submit 
> argument.
>Reporter: kaushik srinivas
>Priority: Major
> Attachments: StackTrace1.txt, StackTrace2.txt, StackTrace3.txt, 
> StackTrace4.txt
>
>
> Below is the scenario being tested,
> Job :
> Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA 
> which is in parquet,snappy format and hive tables created on top of it.
> Cluster manager :
> Kubernetes
> Spark sql configuration :
> Set 1 :
> spark.executor.heartbeatInterval 20s
> spark.executor.cores 4
> spark.driver.cores 4
> spark.driver.memory 15g
> spark.executor.memory 15g
> spark.cores.max 220
> spark.rpc.numRetries 5
> spark.rpc.retry.wait 5
> spark.network.timeout 1800
> spark.sql.broadcastTimeout 1200
> spark.sql.crossJoin.enabled true
> spark.sql.starJoinOptimization true
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
> spark.sql.codegen true
> spark.kubernetes.allocation.batch.size 30
> Set 2 :
> spark.executor.heartbeatInterval 20s
> spark.executor.cores 4
> spark.driver.cores 4
> spark.driver.memory 11g
> spark.driver.memoryOverhead 4g
> spark.executor.memory 11g
> spark.executor.memoryOverhead 4g
> spark.cores.max 220
> spark.rpc.numRetries 5
> spark.rpc.retry.wait 5
> spark.network.timeout 1800
> spark.sql.broadcastTimeout 1200
> spark.sql.crossJoin.enabled true
> spark.sql.starJoinOptimization true
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
> spark.sql.codegen true
> spark.kubernetes.allocation.batch.size 30
> Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value 
> of 64mb.
> 50 executors are being spawned using spark.executor.instances=50 submit 
> argument.
> Issues Observed:
> Spark sql job is terminating abruptly and the drivers,executors are being 
> killed randomly.
> driver and executors pods gets killed suddenly the job fails.
> Few different stack traces are found across different runs,
> Stack Trace 1:
> "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
> File attached : StackTrace1.txt
> Stack Trace 2: 
> "org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> /192.178.1.105:38039^M
>  at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
>  at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
> File attached : StackTrace2.txt
> Stack Trace 3:
> "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 
> (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, 
> mapId=-1, reduceId=3, message=^M
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 29

[jira] [Created] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.

2018-05-11 Thread kaushik srinivas (JIRA)

kaushik srinivas created SPARK-24249:


 Summary: Spark on kubernetes, pods crashes with spark sql job.
 Key: SPARK-24249
 URL: https://issues.apache.org/jira/browse/SPARK-24249
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.2.0
 Environment: Spark version : spark-2.2.0-k8s-0.5.0-bin-2.7.3

Kubernetes version : Kubernetes 1.9.7

Spark sql configuration :

Set 1 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

Set 2 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 11g
spark.driver.memoryOverhead 4g
spark.executor.memory 11g
spark.executor.memoryOverhead 4g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 
64mb.
50 executors are being spawned using spark.executor.instances=50 submit 
argument.
Reporter: kaushik srinivas


Below is the scenario being tested,

Job :
Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which 
is in parquet,snappy format and hive tables created on top of it.

Cluster manager :
Kubernetes

Spark sql configuration :

Set 1 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

Set 2 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 11g
spark.driver.memoryOverhead 4g
spark.executor.memory 11g
spark.executor.memoryOverhead 4g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 
64mb.
50 executors are being spawned using spark.executor.instances=50 submit 
argument.

Issues Observed:

Spark sql job is terminating abruptly and the drivers,executors are being 
killed randomly.
driver and executors pods gets killed suddenly the job fails.

Few different stack traces are found across different runs,

Stack Trace 1:
"2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
org.apache.spark.SparkException: Exception thrown in awaitResult:
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
File attached : StackTrace1.txt

Stack Trace 2: 
"org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/192.178.1.105:38039^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
File attached : StackTrace2.txt

Stack Trace 3:
"18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 
(TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, 
mapId=-1, reduceId=3, message=^M
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 29^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"
File attached : StackTrace3.txt

Stack Trace 4:
"ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor 
lost for unknown reasons."
This is repeating constantly until the executors are dead completely without 
any stack traces.

Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set()
what does this mean ? anything is

97 matches

Mail list logo