[jira] [Assigned] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24256: Assignee: Apache Spark > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Assignee: Apache Spark >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > 1. We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > 2. We can not use some type-safe aggregation methods on this Dataset, such > as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > 3. We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation that Spark does not support define a Scala case class/tuple > with subfields being any other user-defined type, is because > ExpressionEncoder does not discover the implicit Encoder for the user-defined > field types, thus can not use any Encoder to serde the user-defined fields in > case class/tuple. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean's ExpressionEncoder to discover the > serializer/deserializer/schema from the Encoder of the user-defined type. > With this proposed patch and our minor modification in AvroEncoder, we remove > these limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472933#comment-16472933 ] Apache Spark commented on SPARK-24256: -- User 'fangshil' has created a pull request for this issue: https://github.com/apache/spark/pull/21310 > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > 1. We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > 2. We can not use some type-safe aggregation methods on this Dataset, such > as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > 3. We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation that Spark does not support define a Scala case class/tuple > with subfields being any other user-defined type, is because > ExpressionEncoder does not discover the implicit Encoder for the user-defined > field types, thus can not use any Encoder to serde the user-defined fields in > case class/tuple. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean's ExpressionEncoder to discover the > serializer/deserializer/schema from the Encoder of the user-defined type. > With this proposed patch and our minor modification in AvroEncoder, we remove > these limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24256: Assignee: (was: Apache Spark) > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > 1. We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > 2. We can not use some type-safe aggregation methods on this Dataset, such > as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > 3. We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation that Spark does not support define a Scala case class/tuple > with subfields being any other user-defined type, is because > ExpressionEncoder does not discover the implicit Encoder for the user-defined > field types, thus can not use any Encoder to serde the user-defined fields in > case class/tuple. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean's ExpressionEncoder to discover the > serializer/deserializer/schema from the Encoder of the user-defined type. > With this proposed patch and our minor modification in AvroEncoder, we remove > these limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24256: --- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the
[jira] [Updated] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li updated SPARK-24256: --- Description: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. was: Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The root cause for these limitations is that Spark does not support simply define a Scala case class/tuple with subfields being any other user-defined type, since ExpressionEncoder does not discover the implicit Encoder for the user-defined fields, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we make it work with conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find it is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or
[jira] [Created] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
Fangshi Li created SPARK-24256: -- Summary: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple Key: SPARK-24256 URL: https://issues.apache.org/jira/browse/SPARK-24256 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Fangshi Li Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find it is not flexible for other user-defined types and encoders. For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple. 2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. 3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case. The root cause for these limitations is that Spark does not support simply define a Scala case class/tuple with subfields being any other user-defined type, since ExpressionEncoder does not discover the implicit Encoder for the user-defined fields, thus can not use any Encoder to serde the user-defined fields in case class/tuple. To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type. With this proposed patch and our minor modification in AvroEncoder, we make it work with conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$ This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API
[ https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov updated SPARK-24174: Description: Currently, UI or /environment API call of HistoryServer or WebUI exposes only system properties and SparkConf. However, in some cases when Spark is used in conjunction with Hadoop, it is useful to know Hadoop configuration properties. For example, HDFS or GS buffer sizes, hive metastore settings, and so on. So it would be good to have hadoop properties being exposed in /environment API, for example: {code:none} GET .../application_1525395994996_5/environment { "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} "sparkProperties": ["java.io.tmpdir","/tmp", ...], "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], ...], "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System Classpath"], ...], "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...], } {code} was: Currently, /environment API call exposes only system properties and SparkConf. However, in some cases when Spark is used in conjunction with Hadoop, it is useful to know Hadoop configuration properties. For example, HDFS or GS buffer sizes, hive metastore settings, and so on. So it would be good to have hadoop properties being exposed in /environment API, for example: {code:none} GET .../application_1525395994996_5/environment { "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} "sparkProperties": ["java.io.tmpdir","/tmp", ...], "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], ...], "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System Classpath"], ...], "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...], } {code} > Expose Hadoop config as part of /environment API > > > Key: SPARK-24174 > URL: https://issues.apache.org/jira/browse/SPARK-24174 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Nikolay Sokolov >Priority: Minor > Labels: features, usability > > Currently, UI or /environment API call of HistoryServer or WebUI exposes only > system properties and SparkConf. However, in some cases when Spark is used in > conjunction with Hadoop, it is useful to know Hadoop configuration > properties. For example, HDFS or GS buffer sizes, hive metastore settings, > and so on. > So it would be good to have hadoop properties being exposed in /environment > API, for example: > {code:none} > GET .../application_1525395994996_5/environment > { >"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} >"sparkProperties": ["java.io.tmpdir","/tmp", ...], >"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], > ...], >"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System > Classpath"], ...], >"hadoopProperties": [["dfs.stream-buffer-size", 4096], ...], > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API
[ https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov updated SPARK-24174: Description: Currently, /environment API call exposes only system properties and SparkConf. However, in some cases when Spark is used in conjunction with Hadoop, it is useful to know Hadoop configuration properties. For example, HDFS or GS buffer sizes, hive metastore settings, and so on. So it would be good to have hadoop properties being exposed in /environment API, for example: {code:none} GET .../application_1525395994996_5/environment { "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} "sparkProperties": ["java.io.tmpdir","/tmp", ...], "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], ...], "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System Classpath"], ...], "hadoopProperties": [["dfs.stream-buffer-size", 4096], ...], } {code} was: Currently, /environment API call exposes only system properties and SparkConf. However, in some cases when Spark is used in conjunction with Hadoop, it is useful to know Hadoop configuration properties. For example, HDFS or GS buffer sizes, hive metastore settings, and so on. So it would be good to have hadoop properties being exposed in /environment API, for example: {code:none} GET .../application_1525395994996_5/environment { "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} "sparkProperties": ["java.io.tmpdir","/tmp", ...], "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], ...], "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System Classpath"], ...], "hadoopProperties": [["dfs.stream-buffer-size": 4096], ...], } {code} > Expose Hadoop config as part of /environment API > > > Key: SPARK-24174 > URL: https://issues.apache.org/jira/browse/SPARK-24174 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Nikolay Sokolov >Priority: Minor > Labels: features, usability > > Currently, /environment API call exposes only system properties and > SparkConf. However, in some cases when Spark is used in conjunction with > Hadoop, it is useful to know Hadoop configuration properties. For example, > HDFS or GS buffer sizes, hive metastore settings, and so on. > So it would be good to have hadoop properties being exposed in /environment > API, for example: > {code:none} > GET .../application_1525395994996_5/environment > { >"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} >"sparkProperties": ["java.io.tmpdir","/tmp", ...], >"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], > ...], >"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System > Classpath"], ...], >"hadoopProperties": [["dfs.stream-buffer-size", 4096], ...], > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24174) Expose Hadoop config as part of /environment API
[ https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472884#comment-16472884 ] Nikolay Sokolov commented on SPARK-24174: - [~jerryshao] as far a I understand, YARN exposes setting which are effective for RM server, but not effective settings for specific job. For example it would not include hive settings (as yarn does not load hive-site.xml), also it would be current values, not point-in-time for the job (which may not be convenient, if cluster is being tuned over time). Also, client may have different contents of HADOOP_CONF_DIR compared to YARN RM (consider desktop clients, multi-cluster edge nodes, programmatic clients working outside of cluster). > Expose Hadoop config as part of /environment API > > > Key: SPARK-24174 > URL: https://issues.apache.org/jira/browse/SPARK-24174 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Nikolay Sokolov >Priority: Minor > Labels: features, usability > > Currently, /environment API call exposes only system properties and > SparkConf. However, in some cases when Spark is used in conjunction with > Hadoop, it is useful to know Hadoop configuration properties. For example, > HDFS or GS buffer sizes, hive metastore settings, and so on. > So it would be good to have hadoop properties being exposed in /environment > API, for example: > {code:none} > GET .../application_1525395994996_5/environment > { >"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} >"sparkProperties": ["java.io.tmpdir","/tmp", ...], >"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], > ...], >"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System > Classpath"], ...], >"hadoopProperties": [["dfs.stream-buffer-size": 4096], ...], > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description
[ https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472817#comment-16472817 ] Shivaram Venkataraman commented on SPARK-24255: --- Resolved by https://github.com/apache/spark/pull/21278 > Require Java 8 in SparkR description > > > Key: SPARK-24255 > URL: https://issues.apache.org/jira/browse/SPARK-24255 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Shivaram Venkataraman >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > CRAN checks require that the Java version be set both in package description > and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24255) Require Java 8 in SparkR description
[ https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-24255. --- Resolution: Fixed Assignee: Shivaram Venkataraman Fix Version/s: 2.4.0 2.3.1 > Require Java 8 in SparkR description > > > Key: SPARK-24255 > URL: https://issues.apache.org/jira/browse/SPARK-24255 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > CRAN checks require that the Java version be set both in package description > and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24255) Require Java 8 in SparkR description
Shivaram Venkataraman created SPARK-24255: - Summary: Require Java 8 in SparkR description Key: SPARK-24255 URL: https://issues.apache.org/jira/browse/SPARK-24255 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.0 Reporter: Shivaram Venkataraman CRAN checks require that the Java version be set both in package description and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23907) Support regr_* functions
[ https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472797#comment-16472797 ] Apache Spark commented on SPARK-23907: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/21309 > Support regr_* functions > > > Key: SPARK-23907 > URL: https://issues.apache.org/jira/browse/SPARK-23907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > https://issues.apache.org/jira/browse/HIVE-15978 > {noformat} > Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, > regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference > section 10.9 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24254) Eagerly evaluate some subqueries over LocalRelation
Henry Robinson created SPARK-24254: -- Summary: Eagerly evaluate some subqueries over LocalRelation Key: SPARK-24254 URL: https://issues.apache.org/jira/browse/SPARK-24254 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Henry Robinson Some queries would benefit from evaluating subqueries over {{LocalRelations}} eagerly. For example: {code} SELECT t1.part_col FROM t1 JOIN (SELECT max(part_col) m FROM t2) foo WHERE t1.part_col = foo.m {code} If {{max(part_col)}} could be evaluated during planning, there's an opportunity to prune all but at most one partitions from the scan of {{t1}}. Similarly, a near-identical query with a non-scalar subquery in the {{WHERE}} clause: {code} SELECT * FROM t1 WHERE part_col IN (SELECT part_col FROM t2) {code} could be partially evaluated to eliminate some partitions, and remove the join from the plan. Obviously all subqueries over local relations can't be evaluated during planning, but certain whitelisted aggregates could be if the input cardinality isn't too high. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22594) Handling spark-submit and master version mismatch
[ https://issues.apache.org/jira/browse/SPARK-22594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22594: -- Assignee: (was: Marcelo Vanzin) > Handling spark-submit and master version mismatch > - > > Key: SPARK-22594 > URL: https://issues.apache.org/jira/browse/SPARK-22594 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.0 >Reporter: Jiri Kremser >Priority: Minor > > When using spark-submit in different version than the remote Spark master, > the execution fails on during the message deserialization with this log entry > / exception: > {code} > Error while invoking RpcHandler#receive() for one-way message. > java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local > class incompatible: stream classdesc serialVersionUID = 1835832137613908542, > local class serialVersionUID = -1329125091869941550 > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:271) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:320) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:270) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:269) > at > org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:604) > at > org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:655) > at > org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:647) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:209) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:114) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) > ... > {code} > This is quite ok and it can be read as version mismatch between the client > and server, however there is no such a message on the client (spark-submit) > side, so if the submitter doesn't have an access to the spark master or spark > UI, there is no way to figure out what is wrong. > I propose sending a {{RpcFailure}} message back from server to client with > some more informative error. I'd use the {{OneWayMessage}} instead of > {{RpcFailure}}, because there was no counterpart {{RpcRequest}}, but I had no > luck sending it using the {{reverseClient.send()}}. I think some internal > protocol is assumed when sending messages server2client. > I have a patch prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22594) Handling spark-submit and master version mismatch
[ https://issues.apache.org/jira/browse/SPARK-22594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22594: -- Assignee: Marcelo Vanzin > Handling spark-submit and master version mismatch > - > > Key: SPARK-22594 > URL: https://issues.apache.org/jira/browse/SPARK-22594 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.0 >Reporter: Jiri Kremser >Assignee: Marcelo Vanzin >Priority: Minor > > When using spark-submit in different version than the remote Spark master, > the execution fails on during the message deserialization with this log entry > / exception: > {code} > Error while invoking RpcHandler#receive() for one-way message. > java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local > class incompatible: stream classdesc serialVersionUID = 1835832137613908542, > local class serialVersionUID = -1329125091869941550 > at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1843) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:271) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:320) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:270) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:269) > at > org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:604) > at > org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:655) > at > org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:647) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:209) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:114) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) > ... > {code} > This is quite ok and it can be read as version mismatch between the client > and server, however there is no such a message on the client (spark-submit) > side, so if the submitter doesn't have an access to the spark master or spark > UI, there is no way to figure out what is wrong. > I propose sending a {{RpcFailure}} message back from server to client with > some more informative error. I'd use the {{OneWayMessage}} instead of > {{RpcFailure}}, because there was no counterpart {{RpcRequest}}, but I had no > luck sending it using the {{reverseClient.send()}}. I think some internal > protocol is assumed when sending messages server2client. > I have a patch prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations
[ https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24253: Assignee: Apache Spark > DataSourceV2: Add DeleteSupport for delete and overwrite operations > --- > > Key: SPARK-24253 > URL: https://issues.apache.org/jira/browse/SPARK-24253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > Implementing delete and overwrite logical plans requires an API to delete > data from a data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations
[ https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472741#comment-16472741 ] Apache Spark commented on SPARK-24253: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21308 > DataSourceV2: Add DeleteSupport for delete and overwrite operations > --- > > Key: SPARK-24253 > URL: https://issues.apache.org/jira/browse/SPARK-24253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > Implementing delete and overwrite logical plans requires an API to delete > data from a data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations
[ https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24253: Assignee: (was: Apache Spark) > DataSourceV2: Add DeleteSupport for delete and overwrite operations > --- > > Key: SPARK-24253 > URL: https://issues.apache.org/jira/browse/SPARK-24253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > Implementing delete and overwrite logical plans requires an API to delete > data from a data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations
Ryan Blue created SPARK-24253: - Summary: DataSourceV2: Add DeleteSupport for delete and overwrite operations Key: SPARK-24253 URL: https://issues.apache.org/jira/browse/SPARK-24253 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue Implementing delete and overwrite logical plans requires an API to delete data from a data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24186) add array reverse and concat
[ https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472718#comment-16472718 ] Apache Spark commented on SPARK-24186: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21307 > add array reverse and concat > - > > Key: SPARK-24186 > URL: https://issues.apache.org/jira/browse/SPARK-24186 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and > https://issues.apache.org/jira/browse/SPARK-23926 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472714#comment-16472714 ] Bryan Cutler commented on SPARK-22232: -- I'm closing the PR for now, will reopen for Spark 3.0.0. The fix makes the behavior consistent but might cause a breaking change for some users. It's not necessary to put the fix in and safeguard with a config because there are workarounds. For example, this is a workaround for the code in the description: {code} from pyspark.sql.types import * from pyspark.sql import * UnsortedRow = Row("a", "c", "b") def toRow(i): return UnsortedRow("1", 3.0, 2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) print rdd.repartition(3).toDF(schema).take(2) # [Row(a=u'1', c=3.0, b=2), Row(a=u'1', c=3.0, b=2)] {code} > Row objects in pyspark created using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Priority: Major > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24252) DataSourceV2: Add catalog support
[ https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472686#comment-16472686 ] Apache Spark commented on SPARK-24252: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21306 > DataSourceV2: Add catalog support > - > > Key: SPARK-24252 > URL: https://issues.apache.org/jira/browse/SPARK-24252 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > DataSourceV2 needs to support create and drop catalog operations in order to > support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support
[ https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24252: Assignee: Apache Spark > DataSourceV2: Add catalog support > - > > Key: SPARK-24252 > URL: https://issues.apache.org/jira/browse/SPARK-24252 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > DataSourceV2 needs to support create and drop catalog operations in order to > support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support
[ https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24252: Assignee: (was: Apache Spark) > DataSourceV2: Add catalog support > - > > Key: SPARK-24252 > URL: https://issues.apache.org/jira/browse/SPARK-24252 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > DataSourceV2 needs to support create and drop catalog operations in order to > support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24252) DataSourceV2: Add catalog support
Ryan Blue created SPARK-24252: - Summary: DataSourceV2: Add catalog support Key: SPARK-24252 URL: https://issues.apache.org/jira/browse/SPARK-24252 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue Fix For: 2.4.0 DataSourceV2 needs to support create and drop catalog operations in order to support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23321) DataSourceV2 should apply some validation when writing.
[ https://issues.apache.org/jira/browse/SPARK-23321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472675#comment-16472675 ] Ryan Blue commented on SPARK-23321: --- I've closed the PR associated with this because validating writes is going to get rolled into the new logical plans. I'm linking the new logical plan issues as dependencies of this one. > DataSourceV2 should apply some validation when writing. > --- > > Key: SPARK-23321 > URL: https://issues.apache.org/jira/browse/SPARK-23321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 writes are not validated. These writes should be validated using > the standard preprocess rules that are used for Hive and DataSource tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24251) DataSourceV2: Add AppendData logical operation
[ https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472655#comment-16472655 ] Apache Spark commented on SPARK-24251: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21305 > DataSourceV2: Add AppendData logical operation > -- > > Key: SPARK-24251 > URL: https://issues.apache.org/jira/browse/SPARK-24251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData > for inserting data in append mode. This is the simplest plan to implement > first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24251) DataSourceV2: Add AppendData logical operation
[ https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24251: Assignee: Apache Spark > DataSourceV2: Add AppendData logical operation > -- > > Key: SPARK-24251 > URL: https://issues.apache.org/jira/browse/SPARK-24251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData > for inserting data in append mode. This is the simplest plan to implement > first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24251) DataSourceV2: Add AppendData logical operation
[ https://issues.apache.org/jira/browse/SPARK-24251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24251: Assignee: (was: Apache Spark) > DataSourceV2: Add AppendData logical operation > -- > > Key: SPARK-24251 > URL: https://issues.apache.org/jira/browse/SPARK-24251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData > for inserting data in append mode. This is the simplest plan to implement > first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24251) DataSourceV2: Add AppendData logical operation
Ryan Blue created SPARK-24251: - Summary: DataSourceV2: Add AppendData logical operation Key: SPARK-24251 URL: https://issues.apache.org/jira/browse/SPARK-24251 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue Fix For: 2.4.0 The SPIP to standardize SQL logical plans (SPARK-23521) proposes AppendData for inserting data in append mode. This is the simplest plan to implement first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472581#comment-16472581 ] Stavros Kontopoulos commented on SPARK-24232: - Cool makes sense. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:12 PM: -- Ok I understand that need for users not to be surprised or we shouldnt break UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective compared to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. Will proceed with a new property, thanx. was (Author: skonto): Ok I understand that need for users not to be surprised or we shouldnt break UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective compared to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. Will proceed with a new property. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472574#comment-16472574 ] Yinan Li commented on SPARK-24232: -- As long as we document it clearly what is for, I think it's OK, particularly given that `secretKeyRef` is a well-known field name used by k8s. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:08 PM: -- Ok I understand that need for users not to be surprised or we shouldnt break UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective compared to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. Will proceed with a new property. was (Author: skonto): Ok I understand that need for users not to be surprised or we shouldnt break UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective compared to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572 ] Stavros Kontopoulos commented on SPARK-24232: - Ok I understand that need for users not to be surprised or break something. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472572#comment-16472572 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 8:07 PM: -- Ok I understand that need for users not to be surprised or we shouldnt break UX. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective compared to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. was (Author: skonto): Ok I understand that need for users not to be surprised or break something. A name though like spark.kubernetes.driver.secretKeyRef.SomeName is not self-explanatory as it does not add much from a semantics perspective to spark.kubernetes.driver.secrets.SomeName. I know secretKeyRef is used also in the Pods spec for env secrets, but personally I dont see it really that readable, but ok for people used to k8s might ring a bell fast ;) . Btw yeah executor secrets need fix as well. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:57 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]= spark.kubernetes.\{driver, executor}.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.\{driver,executor}.secrets.` is now considered a mount secret. [~liyinan926] sounds ok? was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]= spark.kubernetes.\{driver, executor}.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] sounds ok? > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472561#comment-16472561 ] Yinan Li edited comment on SPARK-24232 at 5/11/18 7:55 PM: --- We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is likely confusing to existing users who already use `spark.kubernetes.driver.secrets.=`. It also makes the code unnecessarily complicated. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See [http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management]. was (Author: liyinan926): We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is a breaking change for existing users who already use `spark.kubernetes.driver.secrets.=`. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472561#comment-16472561 ] Yinan Li commented on SPARK-24232: -- We should keep the current semantics of `spark.kubernetes.driver.secrets.=`. The proposal you have above is a breaking change for existing users who already use `spark.kubernetes.driver.secrets.=`. Like what I said on Slack, it's better to do this through a new property prefix, e.g., `spark.kubernetes.driver.secretKeyRef.`. We also need the same for executors. See http://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:52 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.\{driver,executor}.secrets.envKey.[EnvSecretName]= spark.kubernetes.\{driver, executor}.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] sounds ok? was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] sounds ok? > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:50 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] sounds ok? was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] From a first glance I don't see any executor secrets, no need for them? > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:48 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything starting with prefix `spark.kubernetes.driver.secrets.` is now considered a mount secret. [~liyinan926] From a first glance I don't see any executor secrets, no need for them? was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything under prefix spark.kubernetes.driver.secrets. is now mount secrets. [~liyinan926] From a first glance I don't see any executor secrets, no need for them? > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:47 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything under prefix spark.kubernetes.driver.secrets. is now mount secrets. [~liyinan926] From a first glance I don't see any executor secrets, no need for them? was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything under prefix spark.kubernetes.driver.secrets. is now mount secrets. > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos edited comment on SPARK-24232 at 5/11/18 7:44 PM: -- [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= By default everything under prefix spark.kubernetes.driver.secrets. is now mount secrets. was (Author: skonto): [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable
[ https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472551#comment-16472551 ] Stavros Kontopoulos commented on SPARK-24232: - [~dharmesh.kakadia] I am working on adding something like the following: spark.kubernetes.driver.secrets.envKey.[EnvSecretName]= spark.kubernetes.driver.secrets.mountKey.[mountSecretName]= > Allow referring to kubernetes secrets as env variable > - > > Key: SPARK-24232 > URL: https://issues.apache.org/jira/browse/SPARK-24232 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Dharmesh Kakadia >Priority: Major > > Allow referring to kubernetes secrets in the driver process via environment > variables. This will allow developers to use secretes without leaking them in > the code and at the same time secrets can be decoupled and managed > separately. This can be used to refer to passwords, certificates etc while > talking to other service (jdbc passwords, storage keys etc). > So, at the deployment time, something like > ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified > which will make [EnvName].[key] available as an environment variable and in > the code its always referred as env variable [key]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10145. Resolution: Unresolved I'm closing this since a lot in this area has changed since this bug was reported. If issues like this still happen they're bound to look a lot different, so we'd need new (and more complete) logs to debug them. > Executor exit without useful messages when spark runs in spark-streaming > > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory >Reporter: Baogang Wang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_01 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_01 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_01 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:25:41,664 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1439803298368_0005_01_01 > 2015-08-17 17:25:41,668 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_01.t
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472514#comment-16472514 ] Dylan Guedes commented on SPARK-23931: -- [~mn-mikke] I updated with a working version! Would you mind in giving a feedback/suggestion? I've decided to use an array of structs since Java doesn't handle well Scala Tuple2's, but to be fair I'm not sure if it is the best choice. > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results
[ https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472494#comment-16472494 ] Apache Spark commented on SPARK-23852: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/21302 > Parquet MR bug can lead to incorrect SQL results > > > Key: SPARK-23852 > URL: https://issues.apache.org/jira/browse/SPARK-23852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Ryan Blue >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > > Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that > pushing certain predicates to Parquet scanners can return fewer results than > they should. > The bug triggers in Spark when: > * The Parquet file being scanner has stats for the null count, but not the > max or min on the column with the predicate (Apache Impala writes files like > this). > * The vectorized Parquet reader path is not taken, and the parquet-mr reader > is used. > * A suitable <, <=, > or >= predicate is pushed down to Parquet. > The bug is that the parquet-mr interprets the max and min of a row-group's > column as 0 in the absence of stats. So {{col > 0}} will filter all results, > even if some are > 0. > There is no upstream release of Parquet that contains the fix for > PARQUET-1217, although a 1.10 release is planned. > The least impactful workaround is to set the Parquet configuration > {{parquet.filter.stats.enabled}} to {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13007) Document where configuration / properties are read and applied
[ https://issues.apache.org/jira/browse/SPARK-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-13007: --- Priority: Major (was: Critical) > Document where configuration / properties are read and applied > -- > > Key: SPARK-13007 > URL: https://issues.apache.org/jira/browse/SPARK-13007 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Alan Braithwaite >Priority: Major > > While spark is well documented for the most part, often times I have trouble > determining where a configuration applies. > For example, when setting spark.dynamicAllocation.enabled , does it > always apply to the entire cluster manager, or is it possible to configure it > on a per-job level? > Different levels I can think of: > Application > Driver > Executor > Worker > Cluster > And I'm sure there are more. This could be just another column in the > configuration page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9139) Add backwards-compatibility tests for DataType.fromJson()
[ https://issues.apache.org/jira/browse/SPARK-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-9139: -- Priority: Major (was: Critical) > Add backwards-compatibility tests for DataType.fromJson() > - > > Key: SPARK-9139 > URL: https://issues.apache.org/jira/browse/SPARK-9139 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Josh Rosen >Priority: Major > > SQL's DataType.fromJson is a public API and thus must be > backwards-compatible; there are also backwards-compatibility concerns related > to persistence of DataType JSON in metastores. > Unfortunately, we do not have any backwards-compatibility tests which attempt > to read old JSON values that were written by earlier versions of Spark. > DataTypeSuite has "roundtrip" tests that test fromJson(toJson(foo)), but this > doesn't ensure compatibility. > I think that we should address this by capuring the JSON strings produced in > Spark 1.3's DataFrameSuite and adding test cases that try to create DataTypes > from those strings. > This might be a good starter task for someone who wants to contribute to SQL > tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8487) Update reduceByKeyAndWindow docs to highlight that filtering Function must be used
[ https://issues.apache.org/jira/browse/SPARK-8487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-8487: -- Priority: Major (was: Critical) > Update reduceByKeyAndWindow docs to highlight that filtering Function must be > used > -- > > Key: SPARK-8487 > URL: https://issues.apache.org/jira/browse/SPARK-8487 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-3528: -- Priority: Major (was: Critical) > Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL > > > Key: SPARK-3528 > URL: https://issues.apache.org/jira/browse/SPARK-3528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Major > > Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task > {noformat} > spark> sc.textFile("pom.xml").count > ... > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:20862+20863 > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:0+20862 > {noformat} > There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: > {noformat} > override def getPreferredLocations(split: Partition): Seq[String] = { > // TODO: Filtering out "localhost" in case of file:// URLs > val hadoopSplit = split.asInstanceOf[HadoopPartition] > hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") > } > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-21758: --- Priority: Major (was: Critical) > `SHOW TBLPROPERTIES` can not get properties start with spark.sql.* > -- > > Key: SPARK-21758 > URL: https://issues.apache.org/jira/browse/SPARK-21758 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: StanZhai >Priority: Major > > SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts") > Exception: Table test_db.test.tb does not have property: > spark.sql.sources.schema.numParts > The `spark.sql.sources.schema.numParts` property exactly exists in > HiveMetastore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger resolved SPARK-24067. Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 21300 [https://github.com/apache/spark/pull/21300] > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > Fix For: 2.3.1 > > > SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR > w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This > should be backported to 2.3. > > Original Description from SPARK-17147 : > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23771) Uneven Rowgroup size after repartition
[ https://issues.apache.org/jira/browse/SPARK-23771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23771: --- Priority: Major (was: Critical) > Uneven Rowgroup size after repartition > -- > > Key: SPARK-23771 > URL: https://issues.apache.org/jira/browse/SPARK-23771 > Project: Spark > Issue Type: Bug > Components: Input/Output, Shuffle, SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 5.13.1 >Reporter: Johannes Mayer >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I have a Hive table on AVRO files, that i want to read and store as a > partitioned parquet files (one file per partition). > What i do is: > {code:java} > // read the AVRO table and distribute by the partition column > val data = sql("select * from avro_table distribute by part_col") > > // write data as partitioned parquet files > data.write.partitionBy(part_col).parquet("output/path/") > {code} > > I get one file per partition as expected. But often I run into OutOfMemory > Errors. Investigating the issue I found out, that some row groups are very > big and since all data of a row group is held in memory before it is flushed > to disk, i think this causes the OutOfMemory. Other row groups are very > small, containing almost no data. See the output from parquet-tools meta: > > {code:java} > row group 1: RC:5740100 TS:566954562 OFFSET:4 > row group 2: RC:33769 TS:2904145 OFFSET:117971092 > row group 3: RC:31822 TS:2772650 OFFSET:118905225 > row group 4: RC:29854 TS:2704127 OFFSET:119793188 > row group 5: RC:28050 TS:2356729 OFFSET:120660675 > row group 6: RC:26507 TS:2111983 OFFSET:121406541 > row group 7: RC:25143 TS:1967731 OFFSET:122069351 > row group 8: RC:23876 TS:1991238 OFFSET:122682160 > row group 9: RC:22584 TS:2069463 OFFSET:123303246 > row group 10: RC:21225 TS:1955748 OFFSET:123960700 > row group 11: RC:19960 TS:1931889 OFFSET:124575333 > row group 12: RC:18806 TS:1725871 OFFSET:125132862 > row group 13: RC:17719 TS:1653309 OFFSET:125668057 > row group 14: RC:1617743 TS:157973949 OFFSET:134217728{code} > > One thing to notice is, that this file was written in a Spark application > running on 13 executors. Is it possible, that local data is in the big row > group and the remote reads go into seperate (small) row groups? The shuffle > is involved, because data is read with distribute by clause. > > Is this a known bug? Is there a workaround to get even row group sizes? I > want to decrease the row group size using > sc.hadoopConfiguration.setInt("parquet.block.size", 64 * 1024 * 1024) > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23606) Flakey FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23606: --- Priority: Major (was: Critical) > Flakey FileBasedDataSourceSuite > --- > > Key: SPARK-23606 > URL: https://issues.apache.org/jira/browse/SPARK-23606 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Major > > I've seen the following exception twice today in PR builds (one example: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87978/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/). > It's not deterministic, as I've had one PR build pass in the same span. > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over 10.016101897 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:30) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:30) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesy
[jira] [Resolved] (SPARK-24229) Upgrade to the latest Apache Thrift 0.10.0 release
[ https://issues.apache.org/jira/browse/SPARK-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24229. Resolution: Not A Problem That affects the "Apache Thrift Go client library", which is not used by Spark. > Upgrade to the latest Apache Thrift 0.10.0 release > -- > > Key: SPARK-24229 > URL: https://issues.apache.org/jira/browse/SPARK-24229 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ray Donnelly >Priority: Critical > > According to [https://www.cvedetails.com/cve/CVE-2016-5397/] > > .. there are critical vulnerabilities in libthrift 0.9.3 currently vendored > in Apache Spark (and then, for us, into PySpark). > > Can anyone help to assess the seriousness of this and what should be done > about it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472405#comment-16472405 ] Edwina Lu commented on SPARK-23206: --- The design discussion for SPARK-23206 is scheduled for Monday, May 15 at 11am PDT (6pm UTC). https://linkedin.bluejeans.com/1886759322/ > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection
[ https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472393#comment-16472393 ] Marcelo Vanzin commented on SPARK-20922: You should also be able to use just the spark-launcher library from a 2.x version to launch Spark 1.6 jobs, without having to update the rest of the Spark dependencies. > Unsafe deserialization in Spark LauncherConnection > -- > > Key: SPARK-20922 > URL: https://issues.apache.org/jira/browse/SPARK-20922 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Aditya Sharad >Assignee: Marcelo Vanzin >Priority: Major > Labels: security > Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0 > > Attachments: spark-deserialize-master.zip > > > The {{run()}} method of the class > {{org.apache.spark.launcher.LauncherConnection}} performs unsafe > deserialization of data received by its socket. This makes Spark applications > launched programmatically using the {{SparkLauncher}} framework potentially > vulnerable to remote code execution by an attacker with access to any user > account on the local machine. Such an attacker could send a malicious > serialized Java object to multiple ports on the local machine, and if this > port matches the one (randomly) chosen by the Spark launcher, the malicious > object will be deserialized. By making use of gadget chains in code present > on the Spark application classpath, the deserialization process can lead to > RCE or privilege escalation. > This vulnerability is identified by the “Unsafe deserialization” rule on > lgtm.com: > https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58 > > Attached is a proof-of-concept exploit involving a simple > {{SparkLauncher}}-based application and a known gadget chain in the Apache > Commons Beanutils library referenced by Spark. > See the readme file for demonstration instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24233) union operation on read of dataframe does nor produce correct result
[ https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472326#comment-16472326 ] smohr003 commented on SPARK-24233: -- added > union operation on read of dataframe does nor produce correct result > - > > Key: SPARK-24233 > URL: https://issues.apache.org/jira/browse/SPARK-24233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: smohr003 >Priority: Major > > I know that I can use wild card * to read all subfolders. But, I am trying to > use .par and .schema to speed up the read process. > val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/" > Seq((1, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "1") > Seq((11, "one"), (22, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "2") > Seq((111, "one"), (222, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "3") > Seq((, "one"), (, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "4") > Seq((2, "one"), (2, "two")).toDF("k", > "v").write.mode("overwrite").parquet(absolutePath + "5") > > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.\{FileSystem, Path} > import java.net.URI > def readDir(path: String): DataFrame = > { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = > fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = > spark.read.parquet(subDir.head) val dfSchema = df.schema > subDir.tail.par.foreach(p => df = > df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, > df.columns.tail:_*)) df } > val dfAll = readDir(absolutePath) > dfAll.count > The count of produced dfAll is 4, which in this example should be 10. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472298#comment-16472298 ] Mihaly Toth commented on SPARK-22918: - Yep, probably anybody who introduces a SecurityManager needs to grant the above permission. I am just wondering why such implementation details are propagating up in the architecture. Is there any added level of security in addition to the file system level security below derby? Should not there be at least a configuration option to disable such security checks? > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187
[jira] [Commented] (SPARK-21569) Internal Spark class needs to be kryo-registered
[ https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472295#comment-16472295 ] Ted Yu commented on SPARK-21569: What would be workaround ? Thanks > Internal Spark class needs to be kryo-registered > > > Key: SPARK-21569 > URL: https://issues.apache.org/jira/browse/SPARK-21569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Major > > [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf] > As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when > {{spark.kryo.registrationRequired=true}}) with: > {code} > java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage > Note: To register this class use: > kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This internal Spark class should be kryo-registered by Spark by default. > This was not a problem in 2.1.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
[ https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24172. - Resolution: Fixed Fix Version/s: 2.4.0 > we should not apply operator pushdown to data source v2 many times > -- > > Key: SPARK-24172 > URL: https://issues.apache.org/jira/browse/SPARK-24172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection
[ https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472231#comment-16472231 ] Marcelo Vanzin commented on SPARK-20922: I think Spark 1.6 at this point is considered EOL by the community; there are no more planned releases for that line that I know of. > Unsafe deserialization in Spark LauncherConnection > -- > > Key: SPARK-20922 > URL: https://issues.apache.org/jira/browse/SPARK-20922 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Aditya Sharad >Assignee: Marcelo Vanzin >Priority: Major > Labels: security > Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0 > > Attachments: spark-deserialize-master.zip > > > The {{run()}} method of the class > {{org.apache.spark.launcher.LauncherConnection}} performs unsafe > deserialization of data received by its socket. This makes Spark applications > launched programmatically using the {{SparkLauncher}} framework potentially > vulnerable to remote code execution by an attacker with access to any user > account on the local machine. Such an attacker could send a malicious > serialized Java object to multiple ports on the local machine, and if this > port matches the one (randomly) chosen by the Spark launcher, the malicious > object will be deserialized. By making use of gadget chains in code present > on the Spark application classpath, the deserialization process can lead to > RCE or privilege escalation. > This vulnerability is identified by the “Unsafe deserialization” rule on > lgtm.com: > https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58 > > Attached is a proof-of-concept exploit involving a simple > {{SparkLauncher}}-based application and a known gadget chain in the Apache > Commons Beanutils library referenced by Spark. > See the readme file for demonstration instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24220) java.lang.NullPointerException at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
[ https://issues.apache.org/jira/browse/SPARK-24220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472195#comment-16472195 ] Kazuaki Ishizaki commented on SPARK-24220: -- Thank you for reporting an issue. Would it be possible to post standalone reproduable program? This program seems to connect to an external database or something thru {{DriverManager.getConnection(adminUrl)}}. > java.lang.NullPointerException at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83) > > > Key: SPARK-24220 > URL: https://issues.apache.org/jira/browse/SPARK-24220 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: joy-m >Priority: Major > > def getInputStream(rows:Iterator[Row]): PipedInputStream ={ > printMem("before gen string") > val pipedOutputStream = new PipedOutputStream() > (new Thread() { > override def run(){ > if(rows == null){ > logError("rows is null==>") > }else{ > println(s"record-start-${rows.length}") > try { > while (rows.hasNext) { > val row = rows.next() > println(row) > val str = row.mkString("\001") + "\r\n" > println(str) > pipedOutputStream.write(str.getBytes(StandardCharsets.UTF_8)) > } > println("record-end-") > pipedOutputStream.close() > } catch { > case ex:Exception => > ex.printStackTrace() > } > } > } > }).start() > println("pipedInPutStream--") > val pipedInPutStream = new PipedInputStream() > pipedInPutStream.connect(pipedOutputStream) > println("pipedInPutStream--- conn---") > printMem("after gen string") > pipedInPutStream > } > resDf.coalesce(15).foreachPartition(rows=>{ > if(rows == null){ > logError("rows is null=>") > }else{ > val copyCmd = s"COPY ${tableName} FROM STDIN with DELIMITER as '\001' NULL > as 'null string'" > var con: Connection = null > try { > con = DriverManager.getConnection(adminUrl) > val copyManager = new CopyManager(con.asInstanceOf[BaseConnection]) > val start = System.currentTimeMillis() > var count: Long = 0 > var copyCount: Long = 0 > println("before copyManager=>") > copyCount += copyManager.copyIn(copyCmd, getInputStream(rows)) > println("after copyManager=>") > val finish = System.currentTimeMillis() > println("copyCount:" + copyCount + " count:" + count + " time(s):" + (finish > - start) / 1000) > con.close() > } catch { > case ex:Exception => > ex.printStackTrace() > println(s"copyIn error!${ex.toString}") > } finally { > try { > if (con != null) { > con.close() > } > } catch { > case ex:SQLException => > ex.printStackTrace() > println(s"copyIn error!${ex.toString}") > } > } > } > > 18/05/09 13:31:30 ERROR util.SparkUncaughtExceptionHandler: Uncaught > exception in thread Thread[Thread-4,5,main] > java.lang.NullPointerException > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83) > at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown > Source) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) > at > org.apache.spark.storage.BlockManager$
[jira] [Assigned] (SPARK-24228) Fix the lint error
[ https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24228: Assignee: Apache Spark > Fix the lint error > -- > > Key: SPARK-24228 > URL: https://issues.apache.org/jira/browse/SPARK-24228 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Minor > > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] > (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream. > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] > (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24228) Fix the lint error
[ https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24228: Assignee: (was: Apache Spark) > Fix the lint error > -- > > Key: SPARK-24228 > URL: https://issues.apache.org/jira/browse/SPARK-24228 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Minor > > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] > (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream. > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] > (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24228) Fix the lint error
[ https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472187#comment-16472187 ] Apache Spark commented on SPARK-24228: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21301 > Fix the lint error > -- > > Key: SPARK-24228 > URL: https://issues.apache.org/jira/browse/SPARK-24228 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Minor > > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] > (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream. > [ERROR] > src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] > (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472169#comment-16472169 ] Marek Novotny commented on SPARK-23931: --- Ok. Good luck! > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472161#comment-16472161 ] Dylan Guedes commented on SPARK-23931: -- Hi Marek! I finally get some progress, I think that more a few hours and I can complete this. Thank you! > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472157#comment-16472157 ] Marek Novotny commented on SPARK-23931: --- [~DylanGuedes] Any joy? I can take this one if you want. > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22900) remove unnecessary restrict for streaming dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-22900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22900. --- Resolution: Not A Problem > remove unnecessary restrict for streaming dynamic allocation > > > Key: SPARK-22900 > URL: https://issues.apache.org/jira/browse/SPARK-22900 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.3.0 >Reporter: sharkd tu >Priority: Major > > When i set the conf `spark.streaming.dynamicAllocation.enabled=true`, the > conf `num-executors` can not be set. As a result, it will allocate default 2 > executors and all receivers will be run on this 2 executors, there may not be > redundant cpu cores for tasks. it will stuck all the time. > in my opinion, we should remove unnecessary restrict for streaming dynamic > allocation. we can set `num-executors` and > `spark.streaming.dynamicAllocation.enabled=true` together. when application > starts, each receiver will be run on an executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
[ https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22470. --- Resolution: Won't Fix > Doc that functions.hash is also used internally for shuffle and bucketing > - > > Key: SPARK-22470 > URL: https://issues.apache.org/jira/browse/SPARK-22470 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that > appears to be the same hash function as what Spark uses internally for > shuffle and bucketing. > One of my users would like to bake this assumption into code, but is unsure > if it's a guarantee or a coincidence that they're the same function. Would > it be considered an API break if at some point the two functions were > different, or if the implementation of both changed together? > We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13158) Show the information of broadcast blocks in WebUI
[ https://issues.apache.org/jira/browse/SPARK-13158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472114#comment-16472114 ] David Moravek commented on SPARK-13158: --- Hello, is there any reason this didn't get merged (I'd like to reopen this)? It would be really helpful for debugging. > Show the information of broadcast blocks in WebUI > - > > Key: SPARK-13158 > URL: https://issues.apache.org/jira/browse/SPARK-13158 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket targets a function to show the information of broadcast blocks, # > of blocks total size in mem/disk in a cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8605) Exclude files in StreamingContext. textFileStream(directory)
[ https://issues.apache.org/jira/browse/SPARK-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8605. -- Resolution: Won't Fix > Exclude files in StreamingContext. textFileStream(directory) > > > Key: SPARK-8605 > URL: https://issues.apache.org/jira/browse/SPARK-8605 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Noel Vo >Priority: Major > Labels: streaming, streaming_api > > Currenly, spark streaming can monitor a directory and it will process the > newly added files. This will cause a bug if the files copied to the directory > are big. For example, in hdfs, if a file is being copied, its name is > file_name._COPYING_. Spark will pick up the file and process. However, when > it's done copying the file, the file name becomes file_name. This would cause > FileDoesNotExist error. It would be great if we can exclude files using regex > in the directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24067: Assignee: Apache Spark (was: Cody Koeninger) > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Apache Spark >Priority: Major > > SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR > w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This > should be backported to 2.3. > > Original Description from SPARK-17147 : > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24067: Assignee: Cody Koeninger (was: Apache Spark) > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > > SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR > w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This > should be backported to 2.3. > > Original Description from SPARK-17147 : > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472071#comment-16472071 ] Apache Spark commented on SPARK-24067: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/21300 > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > > SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR > w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This > should be backported to 2.3. > > Original Description from SPARK-17147 : > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24179) History Server for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472035#comment-16472035 ] Abhishek Rao commented on SPARK-24179: -- We have brought up Spark History Server on Kubernetes using following approach. 1) Prepare History-server docker container. a. Inherit Dockerfile from Kubespark/spark-base. b. Set Environment variable SPARK_NO_DAEMONIZE so that history-server can start as a Daemon. c. As part of Docker entry point, we’re invoking start-history-server.sh 2) Bringup Kubernetes pod using the docker image built in step 1 3) Bringup Kubernetes service for the histoty-server pod with type ClusterIP or NodePort. 4) If we bringup service with NodePort, History server can be accessed using NodeIP:NodePort 5) If we bringup service with ClusterIP, then we need to create Kubernetes ingress which will forward the request to the service brought up in step 3. Once we have the ingress up, history server can be accessed using Edge Node IP with the “path” specified in ingress. 6) Only limitation with ingress is that redirection is happening only when we use “/” in path. If we use any other string, redirection is not happening This is validated using spark binaries from apache-spark-on-k8s forked from spark 2.2 as well as apache spark 2.3. Attached are the screen shots for both versions of spark. > History Server for Kubernetes > - > > Key: SPARK-24179 > URL: https://issues.apache.org/jira/browse/SPARK-24179 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Eric Charles >Priority: Major > Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG > > > The History server is missing when running on Kubernetes, with the side > effect we can not debug post-mortem or analyze after-the-fact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24179) History Server for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rao updated SPARK-24179: - Attachment: Spark2_3_History_Server.PNG > History Server for Kubernetes > - > > Key: SPARK-24179 > URL: https://issues.apache.org/jira/browse/SPARK-24179 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Eric Charles >Priority: Major > Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG > > > The History server is missing when running on Kubernetes, with the side > effect we can not debug post-mortem or analyze after-the-fact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24179) History Server for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-24179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rao updated SPARK-24179: - Attachment: Spark2_2_History_Server.PNG > History Server for Kubernetes > - > > Key: SPARK-24179 > URL: https://issues.apache.org/jira/browse/SPARK-24179 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Eric Charles >Priority: Major > Attachments: Spark2_2_History_Server.PNG, Spark2_3_History_Server.PNG > > > The History server is missing when running on Kubernetes, with the side > effect we can not debug post-mortem or analyze after-the-fact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24250) support accessing SQLConf inside tasks
[ https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24250: Assignee: Apache Spark (was: Wenchen Fan) > support accessing SQLConf inside tasks > -- > > Key: SPARK-24250 > URL: https://issues.apache.org/jira/browse/SPARK-24250 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24250) support accessing SQLConf inside tasks
[ https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471891#comment-16471891 ] Apache Spark commented on SPARK-24250: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21299 > support accessing SQLConf inside tasks > -- > > Key: SPARK-24250 > URL: https://issues.apache.org/jira/browse/SPARK-24250 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24250) support accessing SQLConf inside tasks
[ https://issues.apache.org/jira/browse/SPARK-24250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24250: Assignee: Wenchen Fan (was: Apache Spark) > support accessing SQLConf inside tasks > -- > > Key: SPARK-24250 > URL: https://issues.apache.org/jira/browse/SPARK-24250 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24250) support accessing SQLConf inside tasks
Wenchen Fan created SPARK-24250: --- Summary: support accessing SQLConf inside tasks Key: SPARK-24250 URL: https://issues.apache.org/jira/browse/SPARK-24250 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471812#comment-16471812 ] Li Yuanjian commented on SPARK-23128: - I collected some user cases and performance improve effect during Baidu internal usage of this patch, summarize as following 3 scenario: 1. SortMergeJoin to BroadcastJoin The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us {color:red}50% to 200%{color} boosting on query performance, and this strategy alway hit the BI scenario like join several tables with filter strategy in subquery 2. Long running application or use Spark as a service In this case, long running application refers to the duration of application near 1 hour. Using Spark as a service refers to use spark-shell and keep submit sql or use the service of Spark like Zeppelin, Livy or our internal sql service Baidu BigSQL. In such scenario, all spark jobs share same partition number, so enable AE and add configs about expected task info including data size, row number, min\max partition number and etc, will bring us {color:red}50%-100%{color} boosting on performance improvement. 3. GraphFrame jobs The last scenario is the application use GraphFrame, in this case, user has a 2-dimension graph with 1 billion edges, use the connected componentsalgorithm in GraphFrame. With enabling AE, the duration of app reduce from 58min to 32min, almost {color:red}100%{color} boosting on performance improvement. The detailed screenshot and config in the attached pdf. > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23128: Attachment: AdaptiveExecutioninBaidu.pdf > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-23128: Comment: was deleted (was: I collected some user cases and performance improve effect during Baidu internal usage of this patch, summarize as following 3 scenario: 1. SortMergeJoin to BroadcastJoin The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us {color:red}50% to 200%{color} boosting on query performance, and this strategy alway hit the BI scenario like join several tables with filter strategy in subquery 2. Long running application or use Spark as a service In this case, long running application refers to the duration of application near 1 hour. Using Spark as a service refers to use spark-shell and keep submit sql or use the service of Spark like Zeppelin, Livy or our internal sql service Baidu BigSQL. In such scenario, all spark jobs share same partition number, so enable AE and add configs about expected task info including data size, row number, min\max partition number and etc, will bring us {color:red}50%-100%{color} boosting on performance improvement. 3. GraphFrame jobs The last scenario is the application use GraphFrame, in this case, user has a 2-dimension graph with 1 billion edges, use the connected componentsalgorithm in GraphFrame. With enabling AE, the duration of app reduce from 58min to 32min, almost {color:red}100%{color} boosting on performance improvement. The detailed screenshot and config in the attached pdf. ) > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471811#comment-16471811 ] Li Yuanjian commented on SPARK-23128: - I collected some user cases and performance improve effect during Baidu internal usage of this patch, summarize as following 3 scenario: 1. SortMergeJoin to BroadcastJoin The SortMergeJoin transform to BroadcastJoin over deeply tree node can bring us {color:red}50% to 200%{color} boosting on query performance, and this strategy alway hit the BI scenario like join several tables with filter strategy in subquery 2. Long running application or use Spark as a service In this case, long running application refers to the duration of application near 1 hour. Using Spark as a service refers to use spark-shell and keep submit sql or use the service of Spark like Zeppelin, Livy or our internal sql service Baidu BigSQL. In such scenario, all spark jobs share same partition number, so enable AE and add configs about expected task info including data size, row number, min\max partition number and etc, will bring us {color:red}50%-100%{color} boosting on performance improvement. 3. GraphFrame jobs The last scenario is the application use GraphFrame, in this case, user has a 2-dimension graph with 1 billion edges, use the connected componentsalgorithm in GraphFrame. With enabling AE, the duration of app reduce from 58min to 32min, almost {color:red}100%{color} boosting on performance improvement. The detailed screenshot and config in the attached pdf. > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24182) Improve error message for client mode when AM fails
[ https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24182: --- Assignee: Marcelo Vanzin > Improve error message for client mode when AM fails > --- > > Key: SPARK-24182 > URL: https://issues.apache.org/jira/browse/SPARK-24182 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > Today, when the client AM fails, there's not a lot of useful information > printed on the output. Depending on the type of failure, the information > provided by the YARN AM is also not very useful. For example, you'd see this > in the Spark shell: > {noformat} > 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext. > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > [long stack trace] > {noformat} > Similarly, on the YARN RM, for certain failures you see a generic error like > this: > {noformat} > ExitCodeException exitCode=10: at > org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at > org.apache.hadoop.util.Shell.run(Shell.java:460) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366) > at > [blah blah blah] > {noformat} > It would be nice if we could provide a more accurate description of what went > wrong when possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24182) Improve error message for client mode when AM fails
[ https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24182. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21243 [https://github.com/apache/spark/pull/21243] > Improve error message for client mode when AM fails > --- > > Key: SPARK-24182 > URL: https://issues.apache.org/jira/browse/SPARK-24182 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.4.0 > > > Today, when the client AM fails, there's not a lot of useful information > printed on the output. Depending on the type of failure, the information > provided by the YARN AM is also not very useful. For example, you'd see this > in the Spark shell: > {noformat} > 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext. > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > [long stack trace] > {noformat} > Similarly, on the YARN RM, for certain failures you see a generic error like > this: > {noformat} > ExitCodeException exitCode=10: at > org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at > org.apache.hadoop.util.Shell.run(Shell.java:460) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366) > at > [blah blah blah] > {noformat} > It would be nice if we could provide a more accurate description of what went > wrong when possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection
[ https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471600#comment-16471600 ] Ruslan Fialkovsky commented on SPARK-20922: --- I can't update Spark to 2.2. Will you make fix for Spark 1.6 branch ? > Unsafe deserialization in Spark LauncherConnection > -- > > Key: SPARK-20922 > URL: https://issues.apache.org/jira/browse/SPARK-20922 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Aditya Sharad >Assignee: Marcelo Vanzin >Priority: Major > Labels: security > Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0 > > Attachments: spark-deserialize-master.zip > > > The {{run()}} method of the class > {{org.apache.spark.launcher.LauncherConnection}} performs unsafe > deserialization of data received by its socket. This makes Spark applications > launched programmatically using the {{SparkLauncher}} framework potentially > vulnerable to remote code execution by an attacker with access to any user > account on the local machine. Such an attacker could send a malicious > serialized Java object to multiple ports on the local machine, and if this > port matches the one (randomly) chosen by the Spark launcher, the malicious > object will be deserialized. By making use of gadget chains in code present > on the Spark application classpath, the deserialization process can lead to > RCE or privilege escalation. > This vulnerability is identified by the “Unsafe deserialization” rule on > lgtm.com: > https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58 > > Attached is a proof-of-concept exploit involving a simple > {{SparkLauncher}}-based application and a known gadget chain in the Apache > Commons Beanutils library referenced by Spark. > See the readme file for demonstration instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.
[ https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaushik srinivas updated SPARK-24249: - Description: Below is the scenario being tested, Job : Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it. Cluster manager : Kubernetes Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Set 2 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 11g spark.driver.memoryOverhead 4g spark.executor.memory 11g spark.executor.memoryOverhead 4g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 64mb. 50 executors are being spawned using spark.executor.instances=50 submit argument. Issues Observed: Spark sql job is terminating abruptly and the drivers,executors are being killed randomly. driver and executors pods gets killed suddenly the job fails. Few different stack traces are found across different runs, Stack Trace 1: "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)" File attached : [^StackTrace1.txt] Stack Trace 2: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)" File attached : [^StackTrace2.txt] Stack Trace 3: "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)" File attached : [^StackTrace3.txt] Stack Trace 4: "ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons." This is repeating constantly until the executors are dead completely without any stack traces. File attached : [^StackTrace4.txt] Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set() what does this mean ? anything is wrong or it says failed set is empty that means no failure ? Observations or changes tried out : > Monitored memory and CPU utilisation across executors and none of them are > hitting the limits. > As per few readings and suggestions spark.network.timeout was increased to 1800 from 600, but did not help. > Also, driver and executor memory overhead was kept default in set 1 of the > config and it was 0.1*15g=1.5gb. Increased this value also, explicitly to 4gb and reduced driver and executor memory values to 11gb from 15gb as per set 2. this did not yield any valuable results, same failures are being observed. SparkSql is being used to run the queries, sample code lines : val qresult = spark.sql(q) qresult.show() No manual repartitioning is being done in the code. was: Below is the scenario being tested, Job : Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it. Cluster manager : Kubernetes Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark
[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.
[ https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaushik srinivas updated SPARK-24249: - Description: Below is the scenario being tested, Job : Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it. Cluster manager : Kubernetes Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Set 2 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 11g spark.driver.memoryOverhead 4g spark.executor.memory 11g spark.executor.memoryOverhead 4g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 64mb. 50 executors are being spawned using spark.executor.instances=50 submit argument. Issues Observed: Spark sql job is terminating abruptly and the drivers,executors are being killed randomly. driver and executors pods gets killed suddenly the job fails. Few different stack traces are found across different runs, Stack Trace 1: "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)" File attached : [^StackTrace1.txt] Stack Trace 2: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)" File attached : [^StackTrace2.txt] Stack Trace 3: "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)" File attached : [^StackTrace3.txt] Stack Trace 4: "ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons." This is repeating constantly until the executors are dead completely without any stack traces. [^StackTrace4.txt] Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set() what does this mean ? anything is wrong or it says failed set is empty that means no failure ? Observations or changes tried out : > Monitored memory and CPU utilisation across executors and none of them are > hitting the limits. > As per few readings and suggestions spark.network.timeout was increased to 1800 from 600, but did not help. > Also, driver and executor memory overhead was kept default in set 1 of the > config and it was 0.1*15g=1.5gb. Increased this value also, explicitly to 4gb and reduced driver and executor memory values to 11gb from 15gb as per set 2. this did not yield any valuable results, same failures are being observed. SparkSql is being used to run the queries, sample code lines : val qresult = spark.sql(q) qresult.show() No manual repartitioning is being done in the code. was: Below is the scenario being tested, Job : Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it. Cluster manager : Kubernetes Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/
[jira] [Updated] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.
[ https://issues.apache.org/jira/browse/SPARK-24249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaushik srinivas updated SPARK-24249: - Attachment: StackTrace4.txt StackTrace3.txt StackTrace2.txt StackTrace1.txt > Spark on kubernetes, pods crashes with spark sql job. > - > > Key: SPARK-24249 > URL: https://issues.apache.org/jira/browse/SPARK-24249 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.2.0 > Environment: Spark version : spark-2.2.0-k8s-0.5.0-bin-2.7.3 > Kubernetes version : Kubernetes 1.9.7 > Spark sql configuration : > Set 1 : > spark.executor.heartbeatInterval 20s > spark.executor.cores 4 > spark.driver.cores 4 > spark.driver.memory 15g > spark.executor.memory 15g > spark.cores.max 220 > spark.rpc.numRetries 5 > spark.rpc.retry.wait 5 > spark.network.timeout 1800 > spark.sql.broadcastTimeout 1200 > spark.sql.crossJoin.enabled true > spark.sql.starJoinOptimization true > spark.eventLog.enabled true > spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history > spark.sql.codegen true > spark.kubernetes.allocation.batch.size 30 > Set 2 : > spark.executor.heartbeatInterval 20s > spark.executor.cores 4 > spark.driver.cores 4 > spark.driver.memory 11g > spark.driver.memoryOverhead 4g > spark.executor.memory 11g > spark.executor.memoryOverhead 4g > spark.cores.max 220 > spark.rpc.numRetries 5 > spark.rpc.retry.wait 5 > spark.network.timeout 1800 > spark.sql.broadcastTimeout 1200 > spark.sql.crossJoin.enabled true > spark.sql.starJoinOptimization true > spark.eventLog.enabled true > spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history > spark.sql.codegen true > spark.kubernetes.allocation.batch.size 30 > Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value > of 64mb. > 50 executors are being spawned using spark.executor.instances=50 submit > argument. >Reporter: kaushik srinivas >Priority: Major > Attachments: StackTrace1.txt, StackTrace2.txt, StackTrace3.txt, > StackTrace4.txt > > > Below is the scenario being tested, > Job : > Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA > which is in parquet,snappy format and hive tables created on top of it. > Cluster manager : > Kubernetes > Spark sql configuration : > Set 1 : > spark.executor.heartbeatInterval 20s > spark.executor.cores 4 > spark.driver.cores 4 > spark.driver.memory 15g > spark.executor.memory 15g > spark.cores.max 220 > spark.rpc.numRetries 5 > spark.rpc.retry.wait 5 > spark.network.timeout 1800 > spark.sql.broadcastTimeout 1200 > spark.sql.crossJoin.enabled true > spark.sql.starJoinOptimization true > spark.eventLog.enabled true > spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history > spark.sql.codegen true > spark.kubernetes.allocation.batch.size 30 > Set 2 : > spark.executor.heartbeatInterval 20s > spark.executor.cores 4 > spark.driver.cores 4 > spark.driver.memory 11g > spark.driver.memoryOverhead 4g > spark.executor.memory 11g > spark.executor.memoryOverhead 4g > spark.cores.max 220 > spark.rpc.numRetries 5 > spark.rpc.retry.wait 5 > spark.network.timeout 1800 > spark.sql.broadcastTimeout 1200 > spark.sql.crossJoin.enabled true > spark.sql.starJoinOptimization true > spark.eventLog.enabled true > spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history > spark.sql.codegen true > spark.kubernetes.allocation.batch.size 30 > Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value > of 64mb. > 50 executors are being spawned using spark.executor.instances=50 submit > argument. > Issues Observed: > Spark sql job is terminating abruptly and the drivers,executors are being > killed randomly. > driver and executors pods gets killed suddenly the job fails. > Few different stack traces are found across different runs, > Stack Trace 1: > "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)" > File attached : StackTrace1.txt > Stack Trace 2: > "org.apache.spark.shuffle.FetchFailedException: Failed to connect to > /192.178.1.105:38039^M > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)" > File attached : StackTrace2.txt > Stack Trace 3: > "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 > (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, > mapId=-1, reduceId=3, message=^M > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 29
[jira] [Created] (SPARK-24249) Spark on kubernetes, pods crashes with spark sql job.
kaushik srinivas created SPARK-24249: Summary: Spark on kubernetes, pods crashes with spark sql job. Key: SPARK-24249 URL: https://issues.apache.org/jira/browse/SPARK-24249 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.2.0 Environment: Spark version : spark-2.2.0-k8s-0.5.0-bin-2.7.3 Kubernetes version : Kubernetes 1.9.7 Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Set 2 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 11g spark.driver.memoryOverhead 4g spark.executor.memory 11g spark.executor.memoryOverhead 4g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 64mb. 50 executors are being spawned using spark.executor.instances=50 submit argument. Reporter: kaushik srinivas Below is the scenario being tested, Job : Spark sql job is written in scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it. Cluster manager : Kubernetes Spark sql configuration : Set 1 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 15g spark.executor.memory 15g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Set 2 : spark.executor.heartbeatInterval 20s spark.executor.cores 4 spark.driver.cores 4 spark.driver.memory 11g spark.driver.memoryOverhead 4g spark.executor.memory 11g spark.executor.memoryOverhead 4g spark.cores.max 220 spark.rpc.numRetries 5 spark.rpc.retry.wait 5 spark.network.timeout 1800 spark.sql.broadcastTimeout 1200 spark.sql.crossJoin.enabled true spark.sql.starJoinOptimization true spark.eventLog.enabled true spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history spark.sql.codegen true spark.kubernetes.allocation.batch.size 30 Kryoserialiser is being used and with "spark.kryoserializer.buffer.mb" value of 64mb. 50 executors are being spawned using spark.executor.instances=50 submit argument. Issues Observed: Spark sql job is terminating abruptly and the drivers,executors are being killed randomly. driver and executors pods gets killed suddenly the job fails. Few different stack traces are found across different runs, Stack Trace 1: "2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)" File attached : StackTrace1.txt Stack Trace 2: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)" File attached : StackTrace2.txt Stack Trace 3: "18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)" File attached : StackTrace3.txt Stack Trace 4: "ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons." This is repeating constantly until the executors are dead completely without any stack traces. Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set() what does this mean ? anything is