[jira] [Updated] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs
[ https://issues.apache.org/jira/browse/SPARK-28090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Yushchenko updated SPARK-28090: -- Description: This was already posted (#28016), but the provided example didn't always reproduce the error. This example consistently reproduces the issue. Spark applications freeze on execution plan optimization stage (Catalyst) when a logical execution plan contains a lot of projections that operate on nested struct fields. The code listed below demonstrates the issue. To reproduce the Spark App does the following: * A small dataframe is created from a JSON example. * Several nested transformations (negation of a number) are applied on struct fields and each time a new struct field is created. * Once more than 9 such transformations are applied the Catalyst optimizer freezes on optimizing the execution plan. * You can control the freezing by choosing different upper bound for the Range. E.g. it will work file if the upper bound is 5, but will hang is the bound is 10. {code:java} package com.example import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructField, StructType} import scala.collection.mutable.ListBuffer object SparkApp1IssueSelfContained { // A sample data for a dataframe with nested structs val sample: List[String] = """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ :: """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ :: """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ :: Nil /** * Transforms a column inside a nested struct. The transformed value will be put into a new field of that nested struct * * The output column name can omit the full path as the field will be created at the same level of nesting as the input column. * * @param inputColumnName A column name for which to apply the transformation, e.g. `company.employee.firstName`. * @param outputColumnName The output column name. The path is optional, e.g. you can use `transformedName` instead of `company.employee.transformedName`. * @param expression A function that applies a transformation to a column as a Spark expression. * @return A dataframe with a new field that contains transformed values. */ def transformInsideNestedStruct(df: DataFrame, inputColumnName: String, outputColumnName: String, expression: Column => Column): DataFrame = { def mapStruct(schema: StructType, path: Seq[String], parentColumn: Option[Column] = None): Seq[Column] = { val mappedFields = new ListBuffer[Column]() def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] = { val newColumn = expression(curColumn).as(outputColumnName) mappedFields += newColumn Seq(curColumn) } def handleMatchedNonLeaf(field: StructField, curColumn: Column): Seq[Column] = { // Non-leaf columns need to be further processed recursively field.dataType match { case dt: StructType => Seq(struct(mapStruct(dt, path.tail, Some(curColumn)): _*).as(field.name)) case _ => throw new IllegalArgumentException(s"Field '${field.name}' is not a struct type.") } } val fieldName = path.head val isLeaf = path.lengthCompare(2) < 0 val newColumns = schema.fields.flatMap(field => { // This is the original column (struct field) we want to process val curColumn = parentColumn match { case None => new Column(field.name) case Some(col) => col.getField(field.name).as(field.name) } if (field.name.compareToIgnoreCase(fieldName) != 0) { // Copy unrelated fields as they were Seq(curColumn) } else { // We have found a match if (isLeaf) { handleMatchedLeaf(field, curColumn) } else { handleMatchedNonLeaf(field, curColumn) } } }) newColumns ++ mappedFields } val schema = df.schema val path = inputColumnName.split('.') df.select(mapStruct(schema, path): _*) } /** * This Spark Job demonstrates an issue of execution plan freezing when there are a lot of projections * involving nested structs in an execution
[jira] [Created] (SPARK-28090) Spark hangs when an execution plan has many projections on nested structs
Ruslan Yushchenko created SPARK-28090: - Summary: Spark hangs when an execution plan has many projections on nested structs Key: SPARK-28090 URL: https://issues.apache.org/jira/browse/SPARK-28090 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.4.3 Environment: Tried in * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows * Spark 2.4.3 / Yarn on a Linux cluster Reporter: Ruslan Yushchenko This was already posted (#28016), but the provided example didn't always reproduce the error. Spark applications freeze on execution plan optimization stage (Catalyst) when a logical execution plan contains a lot of projections that operate on nested struct fields. The code listed below demonstrates the issue. To reproduce the Spark App does the following: * A small dataframe is created from a JSON example. * Several nested transformations (negation of a number) are applied on struct fields and each time a new struct field is created. * Once more than 9 such transformations are applied the Catalyst optimizer freezes on optimizing the execution plan. * You can control the freezing by choosing different upper bound for the Range. E.g. it will work file if the upper bound is 5, but will hang is the bound is 10. {code} package com.example import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructField, StructType} import scala.collection.mutable.ListBuffer object SparkApp1IssueSelfContained { // A sample data for a dataframe with nested structs val sample: List[String] = """ { "numerics": {"num1": 101, "num2": 102, "num3": 103, "num4": 104, "num5": 105, "num6": 106, "num7": 107, "num8": 108, "num9": 109, "num10": 110, "num11": 111, "num12": 112, "num13": 113, "num14": 114, "num15": 115} } """ :: """ { "numerics": {"num1": 201, "num2": 202, "num3": 203, "num4": 204, "num5": 205, "num6": 206, "num7": 207, "num8": 208, "num9": 209, "num10": 210, "num11": 211, "num12": 212, "num13": 213, "num14": 214, "num15": 215} } """ :: """ { "numerics": {"num1": 301, "num2": 302, "num3": 303, "num4": 304, "num5": 305, "num6": 306, "num7": 307, "num8": 308, "num9": 309, "num10": 310, "num11": 311, "num12": 312, "num13": 313, "num14": 314, "num15": 315} } """ :: Nil /** * Transforms a column inside a nested struct. The transformed value will be put into a new field of that nested struct * * The output column name can omit the full path as the field will be created at the same level of nesting as the input column. * * @param inputColumnName A column name for which to apply the transformation, e.g. `company.employee.firstName`. * @param outputColumnName The output column name. The path is optional, e.g. you can use `transformedName` instead of `company.employee.transformedName`. * @param expression A function that applies a transformation to a column as a Spark expression. * @return A dataframe with a new field that contains transformed values. */ def transformInsideNestedStruct(df: DataFrame, inputColumnName: String, outputColumnName: String, expression: Column => Column): DataFrame = { def mapStruct(schema: StructType, path: Seq[String], parentColumn: Option[Column] = None): Seq[Column] = { val mappedFields = new ListBuffer[Column]() def handleMatchedLeaf(field: StructField, curColumn: Column): Seq[Column] = { val newColumn = expression(curColumn).as(outputColumnName) mappedFields += newColumn Seq(curColumn) } def handleMatchedNonLeaf(field: StructField, curColumn: Column): Seq[Column] = { // Non-leaf columns need to be further processed recursively field.dataType match { case dt: StructType => Seq(struct(mapStruct(dt, path.tail, Some(curColumn)): _*).as(field.name)) case _ => throw new IllegalArgumentException(s"Field '${field.name}' is not a struct type.") } } val fieldName = path.head val isLeaf = path.lengthCompare(2) < 0 val newColumns = schema.fields.flatMap(field => { // This is the original column (struct field) we want to process val curColumn = parentColumn match { case None => new Column(field.name) case Some(col) => col.getField(field.name).as(field.name) } if (field.name.compareToIgnoreCase(fieldName) != 0) { // Copy unrelated fields as they were Seq(curColumn) } else { // We have found a match if (isLeaf) { handleMatchedLeaf(field, curColumn) } else { handleMatchedNonLeaf(field, curColumn) } } }) newColu
[jira] [Updated] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28088: Description: Enhance LPAD/RPAD function to make {{pad}} parameter optional. > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Enhance LPAD/RPAD function to make {{pad}} parameter optional. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28089) File source v2: support reading output of file streaming Sink
[ https://issues.apache.org/jira/browse/SPARK-28089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28089: Assignee: Apache Spark > File source v2: support reading output of file streaming Sink > - > > Key: SPARK-28089 > URL: https://issues.apache.org/jira/browse/SPARK-28089 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > File source V1 supports reading output of FileStreamSink as batch. > https://github.com/apache/spark/pull/11897 > We should support this in file source V2 as well. When reading with paths, we > first check if there is metadata log of FileStreamSink. If yes, we use > `MetadataLogFileIndex` for listing files; Otherwise, we use > `InMemoryFileIndex`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28089) File source v2: support reading output of file streaming Sink
[ https://issues.apache.org/jira/browse/SPARK-28089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28089: Assignee: (was: Apache Spark) > File source v2: support reading output of file streaming Sink > - > > Key: SPARK-28089 > URL: https://issues.apache.org/jira/browse/SPARK-28089 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > File source V1 supports reading output of FileStreamSink as batch. > https://github.com/apache/spark/pull/11897 > We should support this in file source V2 as well. When reading with paths, we > first check if there is metadata log of FileStreamSink. If yes, we use > `MetadataLogFileIndex` for listing files; Otherwise, we use > `InMemoryFileIndex`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28089) File source v2: support reading output of file streaming Sink
Gengliang Wang created SPARK-28089: -- Summary: File source v2: support reading output of file streaming Sink Key: SPARK-28089 URL: https://issues.apache.org/jira/browse/SPARK-28089 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang File source V1 supports reading output of FileStreamSink as batch. https://github.com/apache/spark/pull/11897 We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use `MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28088: Assignee: Apache Spark > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
[ https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28088: Assignee: (was: Apache Spark) > String Functions: Enhance LPAD/RPAD function > > > Key: SPARK-28088 > URL: https://issues.apache.org/jira/browse/SPARK-28088 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28088) String Functions: Enhance LPAD/RPAD function
Yuming Wang created SPARK-28088: --- Summary: String Functions: Enhance LPAD/RPAD function Key: SPARK-28088 URL: https://issues.apache.org/jira/browse/SPARK-28088 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28058. -- Resolution: Fixed Fix Version/s: 2.4.4 3.0.0 Issue resolved by pull request 24894 [https://github.com/apache/spark/pull/24894] > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Assignee: Liang-Chi Hsieh >Priority: Minor > Labels: CSV, csv, csvparser > Fix For: 3.0.0, 2.4.4 > > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28058: Assignee: Liang-Chi Hsieh > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Assignee: Liang-Chi Hsieh >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28072) Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+
[ https://issues.apache.org/jira/browse/SPARK-28072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28072: -- Summary: Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+ (was: Use `Iso8601TimestampFormatter` in `FromUnixTime` codegen to fix `IncompatibleClassChangeError` in JDK9+) > Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+ > --- > > Key: SPARK-28072 > URL: https://issues.apache.org/jira/browse/SPARK-28072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > With JDK9+, the generate bytecode of `FromUnixTime` raise > `java.lang.IncompatibleClassChangeError` due to > [JDK-8145148|https://bugs.openjdk.java.net/browse/JDK-8145148] . > This is a blocker in our JDK11 Jenkins job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28056) Document SCALAR_ITER Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-28056. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24897 [https://github.com/apache/spark/pull/24897] > Document SCALAR_ITER Pandas UDF > --- > > Key: SPARK-28056 > URL: https://issues.apache.org/jira/browse/SPARK-28056 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user > can discover the feature and learn how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28087) String Functions: Add support split_part
Yuming Wang created SPARK-28087: --- Summary: String Functions: Add support split_part Key: SPARK-28087 URL: https://issues.apache.org/jira/browse/SPARK-28087 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{split_part(_string_ }}{{text}}{{,_delimiter_ }}{{text}}{{, _field_ }}{{int}}{{)}}|{{text}}|Split _string_ on _delimiter_ and return the given field (counting from one)|{{split_part('abc~@~def~@~ghi', '~@~', 2)}}|{{def}}| https://www.postgresql.org/docs/11/functions-string.html http://prestodb.github.io/docs/current/functions/string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27666) Do not release lock while TaskContext already completed
[ https://issues.apache.org/jira/browse/SPARK-27666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27666: --- Assignee: wuyi > Do not release lock while TaskContext already completed > --- > > Key: SPARK-27666 > URL: https://issues.apache.org/jira/browse/SPARK-27666 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > {code:java} > Exception in thread "Thread-14" java.lang.AssertionError: assertion failed: > Block rdd_0_0 is not locked for reading > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) > at > org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:1000) > at > org.apache.spark.storage.BlockManager.$anonfun$getLocalValues$5(BlockManager.scala:746) > at > org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:47) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:36) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at org.apache.spark.rdd.RDDSuite.$anonfun$new$265(RDDSuite.scala:1185) > at java.lang.Thread.run(Thread.java:748) > {code} > We're facing an issue reported by SPARK-18406 and SPARK-25139. And > [https://github.com/apache/spark/pull/24542] bypassed the issue by capturing > the assertion error to avoid failing the executor. However, when not using > pyspark, issue still exists when user implements a custom > RDD(https://issues.apache.org/jira/browse/SPARK-18406?focusedCommentId=15969384&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15969384) > or task(see demo below), which spawn a separate thread to consume iterator > from a cached parent RDD. > {code:java} > val rdd0 = sc.parallelize(Range(0, 10), 1).cache() > rdd0.collect() > rdd0.mapPartitions { iter => > val t = new Thread(new Runnable { > override def run(): Unit = { > while(iter.hasNext) { > println(iter.next()) > Thread.sleep(1000) > } > } > }) > t.setDaemon(false) > t.start() > Iterator(0) > }.collect() > {code} > we could easily to reproduce the issue using the demo above. > If we could prevent the separate thread from releasing lock on block when > TaskContext has already completed, > then, we won't hit this issue again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27666) Do not release lock while TaskContext already completed
[ https://issues.apache.org/jira/browse/SPARK-27666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27666. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24699 [https://github.com/apache/spark/pull/24699] > Do not release lock while TaskContext already completed > --- > > Key: SPARK-27666 > URL: https://issues.apache.org/jira/browse/SPARK-27666 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > {code:java} > Exception in thread "Thread-14" java.lang.AssertionError: assertion failed: > Block rdd_0_0 is not locked for reading > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) > at > org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:1000) > at > org.apache.spark.storage.BlockManager.$anonfun$getLocalValues$5(BlockManager.scala:746) > at > org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:47) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:36) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at org.apache.spark.rdd.RDDSuite.$anonfun$new$265(RDDSuite.scala:1185) > at java.lang.Thread.run(Thread.java:748) > {code} > We're facing an issue reported by SPARK-18406 and SPARK-25139. And > [https://github.com/apache/spark/pull/24542] bypassed the issue by capturing > the assertion error to avoid failing the executor. However, when not using > pyspark, issue still exists when user implements a custom > RDD(https://issues.apache.org/jira/browse/SPARK-18406?focusedCommentId=15969384&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15969384) > or task(see demo below), which spawn a separate thread to consume iterator > from a cached parent RDD. > {code:java} > val rdd0 = sc.parallelize(Range(0, 10), 1).cache() > rdd0.collect() > rdd0.mapPartitions { iter => > val t = new Thread(new Runnable { > override def run(): Unit = { > while(iter.hasNext) { > println(iter.next()) > Thread.sleep(1000) > } > } > }) > t.setDaemon(false) > t.start() > Iterator(0) > }.collect() > {code} > we could easily to reproduce the issue using the demo above. > If we could prevent the separate thread from releasing lock on block when > TaskContext has already completed, > then, we won't hit this issue again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names
[ https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27930: Description: This ticket list all built-in UDFs have different names: ||PostgreSQL||Spark SQL||Note|| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}. Which makes some formats of PostgreSQL can not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| |to_hex|hex| | |strpos|locate/position| | was: This ticket list all built-in UDFs have different names: ||PostgreSQL||Spark SQL||Note|| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}. Which makes some formats of PostgreSQL can not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| |to_hex|hex| | > List all built-in UDFs have different names > --- > > Key: SPARK-27930 > URL: https://issues.apache.org/jira/browse/SPARK-27930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket list all built-in UDFs have different names: > ||PostgreSQL||Spark SQL||Note|| > |random|rand| | > |format|format_string|Spark's {{format_string}} is based on the > implementation of {{java.util.Formatter}}. > Which makes some formats of PostgreSQL can not supported, such as: > {{format_string('>>%-s<<', 'Hello')}}| > |to_hex|hex| | > |strpos|locate/position| | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27969) Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance
[ https://issues.apache.org/jira/browse/SPARK-27969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866151#comment-16866151 ] Wenchen Fan commented on SPARK-27969: - We should think about what non-deterministic means in Spark, and what are the guarantees. I do agree it's too conservative in this case, but we should think of the big picture before fixing a specific case. AFAIK, expression in Hive has 2 flags: deterministic and stateful. They have different guarantees. I haven't looked into Presto/Impala though. > Non-deterministic expressions in filters or projects can unnecessarily > prevent all scan-time column pruning, harming performance > > > Key: SPARK-27969 > URL: https://issues.apache.org/jira/browse/SPARK-27969 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > > If a scan operator is followed by a projection or filter and those operators > contain _any_ non-deterministic expressions then scan column pruning > optimizations are completely skipped, harming query performance. > Here's an example of the problem: > {code:java} > import org.apache.spark.sql.functions._ > val df = spark.createDataset(Seq( > (1, 2, 3, 4, 5), > (1, 2, 3, 4, 5) > )) > val tmpPath = > java.nio.file.Files.createTempDirectory("column-pruning-bug").toString() > df.write.parquet(tmpPath) > val fromParquet = spark.read.parquet(tmpPath){code} > If all expressions are deterministic then, as expected, column pruning is > pushed into the scan > {code:java} > fromParquet.select("_1").explain > == Physical Plan == *(1) FileScan parquet [_1#68] Batched: true, DataFilters: > [], Format: Parquet, Location: > InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug7865798834978814548], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>{code} > However, if we add a non-deterministic filter then no column pruning is > performed (even though pruning would be safe!): > {code:java} > fromParquet.select("_1").filter(rand() =!= 0).explain > == Physical Plan == > *(1) Project [_1#127] > +- *(1) Filter NOT (rand(-1515289268025792238) = 0.0) > +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, > DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_1:int,_2:int,_3:int,_4:int,_5:int>{code} > Similarly, a non-deterministic expression in a second projection can end up > being collapsed such that it prevents column pruning: > {code:java} > fromParquet.select("_1").select($"_1", rand()).explain > == Physical Plan == > *(1) Project [_1#127, rand(1267140591146561563) AS > rand(1267140591146561563)#141] > +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, > DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_1:int,_2:int,_3:int,_4:int,_5:int> > {code} > I believe that this is caused by SPARK-10316: the Parquet column pruning code > relies on the [{{PhysicalOperation}} unapply > method|https://github.com/apache/spark/blob/v2.4.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L43] > for extracting projects and filters and this helper purposely fails to match > if _any_ projection or filter is non-deterministic. > It looks like this conservative behavior may have originally been added to > avoid pushdown / re-ordering of non-deterministic filter expressions. > However, in this case I feel that it's _too_ conservative: even though we > can't push down non-deterministic filters we should still be able to perform > column pruning. > /cc [~cloud_fan] and [~marmbrus] (it looks like you [discussed collapsing of > non-deterministic > projects|https://github.com/apache/spark/pull/8486#issuecomment-136036533] in > the SPARK-10316 PR, which is related to why the third example above did not > prune). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24175) improve the Spark 2.4 migration guide
[ https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24175. - Resolution: Won't Fix > improve the Spark 2.4 migration guide > - > > Key: SPARK-24175 > URL: https://issues.apache.org/jira/browse/SPARK-24175 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current Spark 2.4 migration guide is not well phrased. We should > 1. State the before behavior > 2. State the after behavior > 3. Add a concrete example with code to illustrate. > For example: > Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after > promotes both sides to TIMESTAMP. To set `false` to > `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous > behavior. This option will be removed in Spark 3.0. > --> > In version 2.3 and earlier, Spark implicitly casts a timestamp column to date > type when comparing with a date column. In version 2.4 and later, Spark casts > the date column to timestamp type instead. As an example, "xxx" would result > in ".." in Spark 2.3, and in Spark 2.4, the result would be "..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Sirek updated SPARK-28067: --- Description: The following test case involving a join followed by a sum aggregation returns the wrong answer for the sum: {code:java} val df = Seq( (BigDecimal("1000"), 1), (BigDecimal("1000"), 1), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2)).toDF("decNum", "intNum") val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) scala> df2.show(40,false) --- sum(decNum) --- 4000.00 --- {code} The result should be 104000.. It appears a partial sum is computed for each join key, as the result returned would be the answer for all rows matching intNum === 1. If only the rows with intNum === 2 are included, the answer given is null: {code:java} scala> val df3 = df.filter($"intNum" === lit(2)) df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: decimal(38,18), intNum: int] scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, "intNum").agg(sum("decNum")) df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df4.show(40,false) --- sum(decNum) --- null --- {code} The correct answer, 10., doesn't fit in the DataType picked for the result, decimal(38,18), so an overflow occurs, which Spark then converts to null. The first example, which doesn't filter out the intNum === 1 values should also return null, indicating overflow, but it doesn't. This may mislead the user to think a valid sum was computed. If whole-stage code gen is turned off: spark.conf.set("spark.sql.codegen.wholeStage", false) ... incorrect results are not returned because the overflow is caught as an exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 was: The following test case involving a join followed by a sum aggregation returns the wrong answer for the sum: {code:java} val df = Seq( (BigDecimal("1000"), 1), (BigDecimal("1000"), 1), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2), (BigDecimal("1000"), 2)).toDF("decNum", "intNum") val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) scala> df2.show(40,false) --- sum(decNum) --- 4000.00 --- {code} The result should be 104000.. It appears a partial sum is computed for each join key, as the result returned would be the answer for all rows matching intNum === 1. If only the rows with intNum === 2 are included, the answer given is null: {code:java} scala> val df3 = df.filter($"intNum" === lit(2)) df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: decimal(38,18), intNum: int] scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, "intNum").agg(sum("decNum")) df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df4.show(40,false) --- sum(decNum) --- null --- {code} The correct answer, 10., doesn't fit in the DataType picked for the result, decimal(38,18), so the overflow is converted to null. The first example, which doesn't filter out the intNum === 1 values should also return null, indicating overflow, but it doesn't. This may mislead the user to think a valid sum was computed. If whole-stage code gen is turned off: spark.conf.set("spark.sql.codegen.wholeStage", false) ... incorrect results are not returned because the overflow is caught as an exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 > Incorrect results in decimal aggregation with whole-stage c
[jira] [Resolved] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28082. -- Resolution: Duplicate > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866098#comment-16866098 ] Hyukjin Kwon commented on SPARK-28082: -- Sorry, [~viirya] for back and forth. I think your first judgement was correct. > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866097#comment-16866097 ] Hyukjin Kwon commented on SPARK-28058: -- Yes, that was what I saw and I thought it's a bug in Unviocity. After another thought, yes, it looks an intended behaviour in Univocity (I guess). Thanks guys for clarification. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names
[ https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27930: Description: This ticket list all built-in UDFs have different names: ||PostgreSQL||Spark SQL||Note|| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}. Which makes some formats of PostgreSQL can not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| |to_hex|hex| | was: This ticket list all built-in UDFs have different names: |PostgreSQL|Spark SQL|Note| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}. Which makes some formats of PostgreSQL can not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| > List all built-in UDFs have different names > --- > > Key: SPARK-27930 > URL: https://issues.apache.org/jira/browse/SPARK-27930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket list all built-in UDFs have different names: > ||PostgreSQL||Spark SQL||Note|| > |random|rand| | > |format|format_string|Spark's {{format_string}} is based on the > implementation of {{java.util.Formatter}}. > Which makes some formats of PostgreSQL can not supported, such as: > {{format_string('>>%-s<<', 'Hello')}}| > |to_hex|hex| | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28041. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24867 [https://github.com/apache/spark/pull/24867] > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28041: Assignee: Hyukjin Kwon > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Major > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28041: Assignee: Bryan Cutler (was: Hyukjin Kwon) > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28056) Document SCALAR_ITER Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-28056: - Assignee: Xiangrui Meng > Document SCALAR_ITER Pandas UDF > --- > > Key: SPARK-28056 > URL: https://issues.apache.org/jira/browse/SPARK-28056 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user > can discover the feature and learn how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28006: Assignee: Apache Spark > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Assignee: Apache Spark >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28006: Assignee: (was: Apache Spark) > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27767) Built-in function: generate_series
[ https://issues.apache.org/jira/browse/SPARK-27767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865912#comment-16865912 ] Dylan Guedes edited comment on SPARK-27767 at 6/17/19 8:02 PM: --- [~smilegator] by the way, I just checked and there is a (minor) difference: when you use `range()` and define it as a sub-query called `x`, for instance, the default name for the column became `x.id`, instead of just `x`, that is the behaviour in Postgres. For instance: {code:sql} from range(-32766, -32764) x; {code} In Spark, looks like you should reference to these values as `x.id`. Meanwhile, in Postgres you can call them through just `x`. EDIT: Btw, this call also does not work: {code:sql} SELECT range(1, 100) OVER () FROM empsalary {code} was (Author: dylanguedes): [~smilegator] by the way, I just checked and there is a (minor) difference: when you use `range()` and define it as a sub-query called `x`, for instance, the default name for the column became `x.id`, instead of just `x`, that is the behaviour in Postgres. For instance: {code:sql} from range(-32766, -32764) x; {code} In Spark, looks like you should reference to these values as `x.id`. Meanwhile, in Postgres you can call them through just `x`. > Built-in function: generate_series > -- > > Key: SPARK-27767 > URL: https://issues.apache.org/jira/browse/SPARK-27767 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://www.postgresql.org/docs/9.1/functions-srf.html] > generate_series(start, stop): Generate a series of values, from start to stop > with a step size of one > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27767) Built-in function: generate_series
[ https://issues.apache.org/jira/browse/SPARK-27767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865912#comment-16865912 ] Dylan Guedes commented on SPARK-27767: -- [~smilegator] by the way, I just checked and there is a (minor) difference: when you use `range()` and define it as a sub-query called `x`, for instance, the default name for the column became `x.id`, instead of just `x`, that is the behaviour in Postgres. For instance: {code:sql} from range(-32766, -32764) x; {code} In Spark, looks like you should reference to these values as `x.id`. Meanwhile, in Postgres you can call them through just `x`. > Built-in function: generate_series > -- > > Key: SPARK-27767 > URL: https://issues.apache.org/jira/browse/SPARK-27767 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://www.postgresql.org/docs/9.1/functions-srf.html] > generate_series(start, stop): Generate a series of values, from start to stop > with a step size of one > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28086) Adds `random()` sql function
Dylan Guedes created SPARK-28086: Summary: Adds `random()` sql function Key: SPARK-28086 URL: https://issues.apache.org/jira/browse/SPARK-28086 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Dylan Guedes Currently, Spark does not have a `random()` function. Postgres, however, does. For instance, this one is not valid: {code:sql} SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY random())) {code} Because of the `random()` call. On the other hand, [Postgres has it.|https://www.postgresql.org/docs/8.2/functions-math.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
Andrew Leverentz created SPARK-28085: Summary: Spark Scala API documentation URLs not working properly in Chrome Key: SPARK-28085 URL: https://issues.apache.org/jira/browse/SPARK-28085 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.3 Reporter: Andrew Leverentz In Chrome version 75, URLs in the Scala API documentation are not working properly, which makes them difficult to bookmark. For example, URLs like the following get redirected to a generic "root" package page: [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] Here's the URL that I get : [https://spark.apache.org/docs/latest/api/scala/index.html#package] This issue seems to have appeared between versions 74 and 75 of Chrome, but the documentation URLs still work in Safari. I suspect that this has something to do with security-related changes to how Chrome 75 handles frames and/or redirects. I've reported this issue to the Chrome team via the in-browser help menu, but I don't have any visibility into their response, so it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27989) Add retries on the connection to the driver
[ https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Luis Pedrosa updated SPARK-27989: -- Description: Any failure in the executor when trying to connect to the driver, will make impossible a connection from that process, which will trigger the creation of another executor scheduled. was: Due to Java caching of negative DNS resolution (failed requests are never retried). Any failure in the DNS when trying to connect to the driver, will make impossible a connection from that process. This happens specially in Kubernetes where network setup of pods can take some time, > Add retries on the connection to the driver > --- > > Key: SPARK-27989 > URL: https://issues.apache.org/jira/browse/SPARK-27989 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > > Any failure in the executor when trying to connect to the driver, will make > impossible a connection from that process, which will trigger the creation of > another executor scheduled. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner
[ https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865758#comment-16865758 ] Sujith Chacko commented on SPARK-28084: --- insert command the partition column will be resolved using the resolver where the resovlver will resolve the names based on spark.sql.caseSensitive property. same logic can be applied for resolving the partition column names in LOAD COMMAND. I am working on this, will raise a PR soon > LOAD DATA command resolving the partition column name considering case > senstive manner > --- > > Key: SPARK-28084 > URL: https://issues.apache.org/jira/browse/SPARK-28084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Sujith Chacko >Priority: Major > Attachments: parition_casesensitive.PNG > > > LOAD DATA command resolving the partition column name considering case > sensitive manner, where as insert command resolves case-insensitive manner. > Refer the snapshot for more details. > !image-2019-06-18-00-04-22-475.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24898) Adding spark.checkpoint.compress to the docs
[ https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-24898: - Assignee: Sandeep > Adding spark.checkpoint.compress to the docs > > > Key: SPARK-24898 > URL: https://issues.apache.org/jira/browse/SPARK-24898 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Riccardo Corbella >Assignee: Sandeep >Priority: Trivial > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > Parameter *spark.checkpoint.compress* is not listed under configuration > properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24898) Adding spark.checkpoint.compress to the docs
[ https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24898: -- Issue Type: Task (was: Bug) > Adding spark.checkpoint.compress to the docs > > > Key: SPARK-24898 > URL: https://issues.apache.org/jira/browse/SPARK-24898 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Riccardo Corbella >Priority: Trivial > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > Parameter *spark.checkpoint.compress* is not listed under configuration > properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner
[ https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-28084: -- Attachment: parition_casesensitive.PNG > LOAD DATA command resolving the partition column name considering case > senstive manner > --- > > Key: SPARK-28084 > URL: https://issues.apache.org/jira/browse/SPARK-28084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Sujith Chacko >Priority: Major > Attachments: parition_casesensitive.PNG > > > LOAD DATA command resolving the partition column name considering case > sensitive manner, where as insert command resolves case-insensitive manner. > Refer the snapshot for more details. > !image-2019-06-18-00-04-22-475.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner
Sujith Chacko created SPARK-28084: - Summary: LOAD DATA command resolving the partition column name considering case senstive manner Key: SPARK-28084 URL: https://issues.apache.org/jira/browse/SPARK-28084 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Sujith Chacko Attachments: parition_casesensitive.PNG LOAD DATA command resolving the partition column name considering case sensitive manner, where as insert command resolves case-insensitive manner. Refer the snapshot for more details. !image-2019-06-18-00-04-22-475.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714 ] Liang-Chi Hsieh edited comment on SPARK-28058 at 6/17/19 3:59 PM: -- [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. This seems to be inherited from Univocity parser, when we use {{CsvParserSettings.selectIndexes}} to do field selection. In above case, the parser returns two tokens where the second token is just null. I'm not sure if it is known behavior of Univocity parser, or it is a bug at Univocity parser. was (Author: viirya): [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mai
[jira] [Updated] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28082: -- Component/s: Documentation > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files
[ https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25341: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support rolling back a shuffle map stage and re-generate the shuffle files > -- > > Key: SPARK-25341 > URL: https://issues.apache.org/jira/browse/SPARK-25341 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 > To completely fix that problem, Spark needs to be able to rollback a shuffle > map stage and rerun all the map tasks. > According to https://github.com/apache/spark/pull/9214 , Spark doesn't > support it currently, as in shuffle writing "first write wins". > Since overwriting shuffle files is hard, we can extend the shuffle id to > include a "shuffle generation number". Then the reduce task can specify which > generation of shuffle it wants to read. > https://github.com/apache/spark/pull/6648 seems in the right direction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query
[ https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28074: -- Component/s: Documentation > [SS] Document caveats on using multiple stateful operations in single query > --- > > Key: SPARK-28074 > URL: https://issues.apache.org/jira/browse/SPARK-28074 > Project: Spark > Issue Type: Documentation > Components: Documentation, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > Please refer > [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E] > to see rationalization of the issue. > tl;dr. Using multiple stateful operators without fully understanding how > global watermark affects the query brings correctness issue. > While we may not want to fully block the operations given some users could > leverage it with understanding of global watermark, it would be necessary for > end users to warn on such breaking-correctness possibility, and provide > alternative(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28066) Optimize UTF8String.trim() for common case of no whitespace
[ https://issues.apache.org/jira/browse/SPARK-28066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28066. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24884 > Optimize UTF8String.trim() for common case of no whitespace > --- > > Key: SPARK-28066 > URL: https://issues.apache.org/jira/browse/SPARK-28066 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > {{UTF8String.trim()}} allocates a new object even if the string has no > whitespace, when it can just return itself. A simple check for this case > makes the method about 3x faster in the common case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714 ] Liang-Chi Hsieh commented on SPARK-28058: - [~hyukjin.kwon] Do you mean this is suspect to be a bug: {code} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +--+--+ |fruit |color | +--+--+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +--+--+ {code} In this case, the reader should read two columns. But the corrupted record has only one column. Reasonably, it should be dropped as a malformed one. But we see the missing column is filled with null. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause
Yuming Wang created SPARK-28083: --- Summary: ANSI SQL: LIKE predicate: ESCAPE clause Key: SPARK-28083 URL: https://issues.apache.org/jira/browse/SPARK-28083 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Format: {noformat} ::= | ::= ::= [ NOT ] LIKE [ ESCAPE ] ::= ::= ::= ::= [ NOT ] LIKE [ ESCAPE ] ::= ::= {noformat} [https://www.postgresql.org/docs/11/functions-matching.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28082: Assignee: Apache Spark > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28082: Assignee: (was: Apache Spark) > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
[ https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865694#comment-16865694 ] Apache Spark commented on SPARK-28082: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/24894 > Add a note to DROPMALFORMED mode of CSV for column pruning > -- > > Key: SPARK-28082 > URL: https://issues.apache.org/jira/browse/SPARK-28082 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Trivial > > This is inspired by SPARK-28058. > When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if > malformed columns aren't read. This behavior is due to CSV parser column > pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of > column pruning. Users will be confused by the fact that {{DROPMALFORMED}} > mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865695#comment-16865695 ] Liang-Chi Hsieh commented on SPARK-28058: - [~stwhit] Thanks for letting us know that! Although it is in the migration guide, for new users it should be good to have the note in {{DROPMALFORMED}} doc. I filed SPARK-28082 to track the doc improvement. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning
Liang-Chi Hsieh created SPARK-28082: --- Summary: Add a note to DROPMALFORMED mode of CSV for column pruning Key: SPARK-28082 URL: https://issues.apache.org/jira/browse/SPARK-28082 Project: Spark Issue Type: Documentation Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh This is inspired by SPARK-28058. When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if malformed columns aren't read. This behavior is due to CSV parser column pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of column pruning. Users will be confused by the fact that {{DROPMALFORMED}} mode doesn't work as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27989) Add retries on the connection to the driver
[ https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Luis Pedrosa updated SPARK-27989: -- Component/s: (was: Kubernetes) > Add retries on the connection to the driver > --- > > Key: SPARK-27989 > URL: https://issues.apache.org/jira/browse/SPARK-27989 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > Due to Java caching of negative DNS resolution (failed requests are never > retried). > Any failure in the DNS when trying to connect to the driver, will make > impossible a connection from that process. > This happens specially in Kubernetes where network setup of pods can take > some time, > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865679#comment-16865679 ] Stuart White edited comment on SPARK-28058 at 6/17/19 3:08 PM: --- Thank you both for your responses. I now see that at the [Spark SQL Upgrading Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], under the [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] section, it states: {noformat} In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the “id,name” header and one row “1234”. In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false. {noformat} I had not noticed that until you called the {{spark.sql.csv.parser.columnPruning.enabled}} option to my attention. Thanks again for the help! was (Author: stwhit): Thank you both for your responses. I now see that at the [Spark SQL Upgrading Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], under the [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] section, it states: {noformat} In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the “id,name” header and one row “1234”. In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false. {noformat} I had not noticed that until you called the {{spark.sql.csv.parser.columnPruning.enabled}} option to my attention. Thanks again for the help! > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865679#comment-16865679 ] Stuart White commented on SPARK-28058: -- Thank you both for your responses. I now see that at the [Spark SQL Upgrading Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], under the [Upgrading From Spark SQL 2.3 to 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24] section, it states: {noformat} In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the “id,name” header and one row “1234”. In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false. {noformat} I had not noticed that until you called the {{spark.sql.csv.parser.columnPruning.enabled}} option to my attention. Thanks again for the help! > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28058: Assignee: Apache Spark > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Assignee: Apache Spark >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28058: Assignee: (was: Apache Spark) > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28081) word2vec 'large' count value too low for very large corpora
[ https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28081: Assignee: Sean Owen (was: Apache Spark) > word2vec 'large' count value too low for very large corpora > --- > > Key: SPARK-28081 > URL: https://issues.apache.org/jira/browse/SPARK-28081 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > The word2vec implementation operates on word counts, and uses a hard-coded > value of 1e9 to mean "a very large count, larger than any actual count". > However this causes the logic to fail if, in fact, a large corpora has some > words that really do occur more than this many times. We can probably improve > the implementation to better handle very large counts in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28081) word2vec 'large' count value too low for very large corpora
[ https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28081: Assignee: Apache Spark (was: Sean Owen) > word2vec 'large' count value too low for very large corpora > --- > > Key: SPARK-28081 > URL: https://issues.apache.org/jira/browse/SPARK-28081 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > The word2vec implementation operates on word counts, and uses a hard-coded > value of 1e9 to mean "a very large count, larger than any actual count". > However this causes the logic to fail if, in fact, a large corpora has some > words that really do occur more than this many times. We can probably improve > the implementation to better handle very large counts in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865664#comment-16865664 ] Liang-Chi Hsieh commented on SPARK-28058: - Although this isn't a bug, I think it might be worth adding a note to current doc to explain/clarify it. > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865650#comment-16865650 ] Liang-Chi Hsieh commented on SPARK-28058: - This is due to CSV parser column pruning. You can disable it and the behavior would like you expect: {code:java} scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) +--+ |fruit | +--+ |apple | |banana| |orange| |xxx | +--+ scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) +--+ |fruit | +--+ |apple | |banana| |orange| +--+ {code} > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > --- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1, 2.4.3 >Reporter: Stuart White >Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +--+--+-++ > > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +--+ > |fruit | > +--+ > |apple | > |banana| > |orange| > |xxx | > +--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +--+--+ > |fruit |color | > +--+--+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +--+--+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +--+--+-+ > |fruit |color |price| > +--+--+-+ > |apple |red |1| > |banana|yellow|2| > |orange|orange|3| > |xxx |null |null | > +--+--+-+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +--+--+-++ > |fruit |color |price|quantity| > +--+--+-++ > |apple |red |1|3 | > |banana|yellow|2|4 | > |orange|orange|3|5 | > +--+--+-++ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28081) word2vec 'large' count value too low for very large corpora
Sean Owen created SPARK-28081: - Summary: word2vec 'large' count value too low for very large corpora Key: SPARK-28081 URL: https://issues.apache.org/jira/browse/SPARK-28081 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.3 Reporter: Sean Owen Assignee: Sean Owen The word2vec implementation operates on word counts, and uses a hard-coded value of 1e9 to mean "a very large count, larger than any actual count". However this causes the logic to fail if, in fact, a large corpora has some words that really do occur more than this many times. We can probably improve the implementation to better handle very large counts in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28076) String Functions: SUBSTRING support regular expression
[ https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865608#comment-16865608 ] Yuming Wang commented on SPARK-28076: - cc [~lipzhu] Would you like to pick up this? > String Functions: SUBSTRING support regular expression > -- > > Key: SPARK-28076 > URL: https://issues.apache.org/jira/browse/SPARK-28076 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring > matching POSIX regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}| > |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract > substring matching SQL regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for > '#')}}|{{oma}}| > For example: > {code:sql} > -- T581 regular expression substring (with SQL's bizarre regexp syntax) > SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28012) Hive UDF supports struct type foldable expression
[ https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-28012: --- Summary: Hive UDF supports struct type foldable expression (was: Hive UDF supports literal struct type) > Hive UDF supports struct type foldable expression > - > > Key: SPARK-28012 > URL: https://issues.apache.org/jira/browse/SPARK-28012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > Currently using hive udf, the parameter is struct type, there will be an > exception thrown. > No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't > support the constant type [StructType(StructField(name,StringType,true), > StructField(value,DecimalType(3,1),true))] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28012) Hive UDF supports literal struct type
[ https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-28012: --- Description: Currently using hive udf, the parameter is struct type, there will be an exception thrown. No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't support the constant type [StructType(StructField(name,StringType,true), StructField(value,DecimalType(3,1),true))] was: Currently using hive udf, the parameter is literal struct type, will report an error. No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't support the constant type [StructType(StructField(name,StringType,true), StructField(value,DecimalType(3,1),true))] > Hive UDF supports literal struct type > - > > Key: SPARK-28012 > URL: https://issues.apache.org/jira/browse/SPARK-28012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > Currently using hive udf, the parameter is struct type, there will be an > exception thrown. > No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't > support the constant type [StructType(StructField(name,StringType,true), > StructField(value,DecimalType(3,1),true))] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865592#comment-16865592 ] Dominic Ricard commented on SPARK-21067: In 2.4, the problem also affect Parquet tables creation... So we had to move all our CTAS to a different process which does not use the thrift server... > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd
[jira] [Created] (SPARK-28080) There is a problem to download and watch offline the history of an application with multiple attempts due to UI inconsistency
Gal Weiss created SPARK-28080: - Summary: There is a problem to download and watch offline the history of an application with multiple attempts due to UI inconsistency Key: SPARK-28080 URL: https://issues.apache.org/jira/browse/SPARK-28080 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Environment: I used the spark-2.4.3-bin-hadoop2.7 and spark-2.3.1-bin-hadoop2.7 packages from [https://spark.apache.org/downloads.html] Running the history server locally as-is (using default values) on ubunto 16.04.4 running using WSL (Windows Subsystem for Linux) on my windows 10 machine. Browser used is firefox 67.0.2 (64-bit) for windows Reporter: Gal Weiss Overview: If you are looking to watch locally a spark application attempt history, trying to see the history of the first attempt (or any attempt but the last one) would fail, because some UI inconsistently. The inconsistency is that in the spark history UI, the "app_id" column is clickable and will always take you to this application *last attempt*, but if you tried to download only the first attempt, you will get an error of application not found. How to reproduce: # open spark any spark history server (if using Azure HDinsight the address would be https://.azurehdinsight.net/sparkhistory/) # look for an application that have multiple attempts (ie - attempt ID > 1) # look for the *first* attempt in this application and download it using the "download" button in the event column. save it in your local spark history folder (default: /tmp/spark-events) # Start a local spark history server (typically: using the start-history-server.sh script) # browse to the local history server and look for the application for which you downloaded the history. # click the application name in the "App ID" column, and you would get the following error: "Application not found." Why ? because on the remote history server it is assumed that all Attempts history files are preset, so the "App ID" column points to the latest attempt of this app, while the "Attempt ID" column points to the specific attempt. But if we have an application with two attempts, and we only want to research the first one, we download it locally, opening with the local history server, and intuitively clicking the link in the "app id" column, the link actually points to the second attempt, which we haven't even downloaded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Ricard updated SPARK-21067: --- Affects Version/s: 2.4.3 > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spa
[jira] [Updated] (SPARK-28078) String Functions: Add support other 4 REGEXP functions
[ https://issues.apache.org/jira/browse/SPARK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28078: Summary: String Functions: Add support other 4 REGEXP functions (was: String Functions: Add support other 4 REGEXP_ functions) > String Functions: Add support other 4 REGEXP functions > -- > > Key: SPARK-28078 > URL: https://issues.apache.org/jira/browse/SPARK-28078 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {{regexp_match}}, {{regexp_matches}}, {{regexp_split_to_array}} and > {{regexp_split_to_table}} > [https://www.postgresql.org/docs/11/functions-string.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
F Jimenez created SPARK-28079: - Summary: CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema Key: SPARK-28079 URL: https://issues.apache.org/jira/browse/SPARK-28079 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3, 2.3.2 Reporter: F Jimenez When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example: {code:java} // Second row has a 4th column that is not declared in the header/schema val csvText = s""" | FieldA, FieldB, FieldC | a1,b1,c1 | a2,b2,c2,d*""".stripMargin val csvFile = new File("/tmp/file.csv") FileUtils.write(csvFile, csvText) val reader = sqlContext.read .format("csv") .option("header", "true") .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "corrupt") .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") reader.load(csvFile.getAbsolutePath).show(truncate = false) {code} This produces the correct result: {code:java} ++--+--+--+ |corrupt |fieldA|fieldB|fieldC| ++--+--+--+ |null | a1 |b1 |c1 | | a2,b2,c2,d*| a2 |b2 |c2 | ++--+--+--+ {code} However removing the "schema" option and going: {code:java} val reader = sqlContext.read .format("csv") .option("header", "true") .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "corrupt") reader.load(csvFile.getAbsolutePath).show(truncate = false) {code} Yields: {code:java} +---+---+---+ | FieldA| FieldB| FieldC| +---+---+---+ | a1 |b1 |c1 | | a2 |b2 |c2 | +---+---+---+ {code} The fourth value "d*" in the second row has been removed and the row not marked as corrupt -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28078) String Functions: Add support other 4 REGEXP_ functions
Yuming Wang created SPARK-28078: --- Summary: String Functions: Add support other 4 REGEXP_ functions Key: SPARK-28078 URL: https://issues.apache.org/jira/browse/SPARK-28078 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang {{regexp_match}}, {{regexp_matches}}, {{regexp_split_to_array}} and {{regexp_split_to_table}} [https://www.postgresql.org/docs/11/functions-string.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28075) String Functions: Enhance TRIM function
[ https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28075: Summary: String Functions: Enhance TRIM function (was: Enhance TRIM function) > String Functions: Enhance TRIM function > --- > > Key: SPARK-28075 > URL: https://issues.apache.org/jira/browse/SPARK-28075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28077) String Functions: Add support OVERLAY
Yuming Wang created SPARK-28077: --- Summary: String Functions: Add support OVERLAY Key: SPARK-28077 URL: https://issues.apache.org/jira/browse/SPARK-28077 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{overlay(_string_ placing _string_ from }}{{int}}{{[for {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from 2 for 4)}}|{{Thomas}}| For example: {code:sql} SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f"; SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba"; SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo"; SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba"; {code} https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28076) String Functions: Support regular expression substring
[ https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28076: Summary: String Functions: Support regular expression substring (was: Support regular expression substring) > String Functions: Support regular expression substring > -- > > Key: SPARK-28076 > URL: https://issues.apache.org/jira/browse/SPARK-28076 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring > matching POSIX regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}| > |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract > substring matching SQL regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for > '#')}}|{{oma}}| > For example: > {code:sql} > -- T581 regular expression substring (with SQL's bizarre regexp syntax) > SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28076) String Functions: SUBSTRING support regular expression
[ https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28076: Summary: String Functions: SUBSTRING support regular expression (was: String Functions: Support regular expression substring) > String Functions: SUBSTRING support regular expression > -- > > Key: SPARK-28076 > URL: https://issues.apache.org/jira/browse/SPARK-28076 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring > matching POSIX regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}| > |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract > substring matching SQL regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for > '#')}}|{{oma}}| > For example: > {code:sql} > -- T581 regular expression substring (with SQL's bizarre regexp syntax) > SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28076) Support regular expression substring
[ https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28076: Description: ||Function||Return Type||Description||Example||Result|| |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring matching POSIX regular expression. See [Section 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}| |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract substring matching SQL regular expression. See [Section 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for '#')}}|{{oma}}| For example: {code:sql} -- T581 regular expression substring (with SQL's bizarre regexp syntax) SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; {code} https://www.postgresql.org/docs/11/functions-string.html was: For example: {code:sql} -- T581 regular expression substring (with SQL's bizarre regexp syntax) SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; {code} > Support regular expression substring > > > Key: SPARK-28076 > URL: https://issues.apache.org/jira/browse/SPARK-28076 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring > matching POSIX regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}| > |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract > substring matching SQL regular expression. See [Section > 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more > information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for > '#')}}|{{oma}}| > For example: > {code:sql} > -- T581 regular expression substring (with SQL's bizarre regexp syntax) > SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28076) Support regular expression substring
Yuming Wang created SPARK-28076: --- Summary: Support regular expression substring Key: SPARK-28076 URL: https://issues.apache.org/jira/browse/SPARK-28076 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang For example: {code:sql} -- T581 regular expression substring (with SQL's bizarre regexp syntax) SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd"; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865433#comment-16865433 ] Chris Martin commented on SPARK-27463: -- sounds good to me too. > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28075) Enhance TRIM function
[ https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28075: Assignee: Apache Spark > Enhance TRIM function > - > > Key: SPARK-28075 > URL: https://issues.apache.org/jira/browse/SPARK-28075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28075) Enhance TRIM function
[ https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28075: Assignee: (was: Apache Spark) > Enhance TRIM function > - > > Key: SPARK-28075 > URL: https://issues.apache.org/jira/browse/SPARK-28075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28075) Enhance TRIM function
Yuming Wang created SPARK-28075: --- Summary: Enhance TRIM function Key: SPARK-28075 URL: https://issues.apache.org/jira/browse/SPARK-28075 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23128: --- Assignee: Carson Wang (was: Maryann Xue) > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865396#comment-16865396 ] Wenchen Fan commented on SPARK-23128: - To split credits, I'm re-assigning this ticket to [~carsonwang] . > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Maryann Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query
[ https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28074: Assignee: Apache Spark > [SS] Document caveats on using multiple stateful operations in single query > --- > > Key: SPARK-28074 > URL: https://issues.apache.org/jira/browse/SPARK-28074 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Please refer > [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E] > to see rationalization of the issue. > tl;dr. Using multiple stateful operators without fully understanding how > global watermark affects the query brings correctness issue. > While we may not want to fully block the operations given some users could > leverage it with understanding of global watermark, it would be necessary for > end users to warn on such breaking-correctness possibility, and provide > alternative(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query
[ https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28074: Assignee: (was: Apache Spark) > [SS] Document caveats on using multiple stateful operations in single query > --- > > Key: SPARK-28074 > URL: https://issues.apache.org/jira/browse/SPARK-28074 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > Please refer > [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E] > to see rationalization of the issue. > tl;dr. Using multiple stateful operators without fully understanding how > global watermark affects the query brings correctness issue. > While we may not want to fully block the operations given some users could > leverage it with understanding of global watermark, it would be necessary for > end users to warn on such breaking-correctness possibility, and provide > alternative(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query
Jungtaek Lim created SPARK-28074: Summary: [SS] Document caveats on using multiple stateful operations in single query Key: SPARK-28074 URL: https://issues.apache.org/jira/browse/SPARK-28074 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim Please refer [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E] to see rationalization of the issue. tl;dr. Using multiple stateful operators without fully understanding how global watermark affects the query brings correctness issue. While we may not want to fully block the operations given some users could leverage it with understanding of global watermark, it would be necessary for end users to warn on such breaking-correctness possibility, and provide alternative(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org