[jira] [Commented] (SPARK-46778) get_json_object flattens wildcard queries that match a single value
[ https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823371#comment-17823371 ] Pablo Langa Blanco commented on SPARK-46778: I was looking at it and I found a comment in the code that explain why this behavior ([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377]) There are some tests around the code that test it and I reproduced it in hive 3.1.3 and it still maintains this behavior so I don't know if we can change it. > get_json_object flattens wildcard queries that match a single value > --- > > Key: SPARK-46778 > URL: https://issues.apache.org/jira/browse/SPARK-46778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Robert Joseph Evans >Priority: Major > > I think this impacts all versions of {{{}get_json_object{}}}, but I am not > 100% sure. > The unit test for > [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146] > verifies that the output of this query is a single level JSON array, but > when I put the same JSON and JSON path into [http://jsonpath.com/] I get a > result with multiple levels of nesting. It looks like Apache Spark tries to > flatten lists for {{[*]}} matches when there is only a single element that > matches. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false) > ++ > |get_json_object(jsonStr, $[*].a)| > ++ > |"A" | > |["A","B"] | > ++ {code} > But this has problems in that I no longer have a consistent schema returned, > even if the input schema is known to be consistent. For example if I wanted > to know how many elements matched this query I could wrap it in a > {{json_array_length}} but that does not work in the generic case. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---+ > |null | > |2 | > +---+ {code} > If the value returned might be a JSON array, and then I would get a number, > but it is wrong. > {code:java} > scala> > Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---+ > |5 | > |2 | > +---+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821273#comment-17821273 ] Pablo Langa Blanco commented on SPARK-47063: Ok, personally I would prefer the truncation too, I'll make a PR with it and we can discuss it there. > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Priority: Major > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala>
[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted
[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820853#comment-17820853 ] Pablo Langa Blanco commented on SPARK-47063: [~revans2] are you working on the fix? > CAST long to timestamp has different behavior for codegen vs interpreted > > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2 >Reporter: Robert Joseph Evans >Priority: Major > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++---+---+ > |v |ts |unix_micros(ts)| > ++---+---+ > |9223372036854775807 |1969-12-31 23:59:59|-100 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|199000 | > ++---+---+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > ++-++ > |v |ts |unix_micros(ts) | > ++-++ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |199000 | > ++-++ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue) > res12: Long = 9223372036854775807 > scala> Long.MaxValue
[jira] [Created] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema
Pablo Langa Blanco created SPARK-47123: -- Summary: JDBCRDD does not correctly handle errors in getQueryOutputSchema Key: SPARK-47123 URL: https://issues.apache.org/jira/browse/SPARK-47123 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0, 4.0.0 Reporter: Pablo Langa Blanco If there is an error executing statement.executeQuery(), it's possible that another error in one of the finally statements makes us not see the main error. {code:java} def getQueryOutputSchema( query: String, options: JDBCOptions, dialect: JdbcDialect): StructType = { val conn: Connection = dialect.createConnectionFactory(options)(-1) try { val statement = conn.prepareStatement(query) try { statement.setQueryTimeout(options.queryTimeout) val rs = statement.executeQuery() try { JdbcUtils.getSchema(rs, dialect, alwaysNullable = true, isTimestampNTZ = options.preferTimestampNTZ) } finally { rs.close() } } finally { statement.close() } } finally { conn.close() } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema
[ https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819413#comment-17819413 ] Pablo Langa Blanco commented on SPARK-47123: I'm working on it > JDBCRDD does not correctly handle errors in getQueryOutputSchema > > > Key: SPARK-47123 > URL: https://issues.apache.org/jira/browse/SPARK-47123 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > If there is an error executing statement.executeQuery(), it's possible that > another error in one of the finally statements makes us not see the main > error. > {code:java} > def getQueryOutputSchema( > query: String, options: JDBCOptions, dialect: JdbcDialect): StructType > = { > val conn: Connection = dialect.createConnectionFactory(options)(-1) > try { > val statement = conn.prepareStatement(query) > try { > statement.setQueryTimeout(options.queryTimeout) > val rs = statement.executeQuery() > try { > JdbcUtils.getSchema(rs, dialect, alwaysNullable = true, > isTimestampNTZ = options.preferTimestampNTZ) > } finally { > rs.close() > } > } finally { > statement.close() > } > } finally { > conn.close() > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45039) Include full identifier in Storage tab
[ https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761295#comment-17761295 ] Pablo Langa Blanco commented on SPARK-45039: https://github.com/apache/spark/pull/42759 > Include full identifier in Storage tab > -- > > Key: SPARK-45039 > URL: https://issues.apache.org/jira/browse/SPARK-45039 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > Attachments: image-2023-08-31-18-38-37-856.png > > > If you have 2 databases with tables with the same name like: db1.table, > db2.table > > {code:java} > scala> spark.sql("use db1") > scala> spark.sql("cache table table") > > scala> spark.sql("use db2") > scala> spark.sql("cache table table"){code} > You cannot differentiate it in the Storage tab > !image-2023-08-31-18-38-37-856.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45039) Include full identifier in Storage tab
[ https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761108#comment-17761108 ] Pablo Langa Blanco commented on SPARK-45039: I'm working on it > Include full identifier in Storage tab > -- > > Key: SPARK-45039 > URL: https://issues.apache.org/jira/browse/SPARK-45039 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > Attachments: image-2023-08-31-18-38-37-856.png > > > If you have 2 databases with tables with the same name like: db1.table, > db2.table > > {code:java} > scala> spark.sql("use db1") > scala> spark.sql("cache table table") > > scala> spark.sql("use db2") > scala> spark.sql("cache table table"){code} > You cannot differentiate it in the Storage tab > !image-2023-08-31-18-38-37-856.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45039) Include full identifier in Storage tab
[ https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-45039: --- Description: If you have 2 databases with tables with the same name like: db1.table, db2.table {code:java} scala> spark.sql("use db1") scala> spark.sql("cache table table") scala> spark.sql("use db2") scala> spark.sql("cache table table"){code} You cannot differentiate it in the Storage tab !image-2023-08-31-18-38-37-856.png! was: If you have 2 databases with tables with the same name like: db1.table, db2.table {code:java} scala> spark.sql("use db1") scala> spark.sql("cache table table") scala> spark.sql("use db2") scala> spark.sql("cache table table"){code} You cannot differentiate it in the Storage tab !image-2023-08-31-18-37-48-330.png! > Include full identifier in Storage tab > -- > > Key: SPARK-45039 > URL: https://issues.apache.org/jira/browse/SPARK-45039 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > Attachments: image-2023-08-31-18-38-37-856.png > > > If you have 2 databases with tables with the same name like: db1.table, > db2.table > > {code:java} > scala> spark.sql("use db1") > scala> spark.sql("cache table table") > > scala> spark.sql("use db2") > scala> spark.sql("cache table table"){code} > You cannot differentiate it in the Storage tab > !image-2023-08-31-18-38-37-856.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45039) Include full identifier in Storage tab
[ https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-45039: --- Attachment: image-2023-08-31-18-38-37-856.png > Include full identifier in Storage tab > -- > > Key: SPARK-45039 > URL: https://issues.apache.org/jira/browse/SPARK-45039 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 4.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > Attachments: image-2023-08-31-18-38-37-856.png > > > If you have 2 databases with tables with the same name like: db1.table, > db2.table > > {code:java} > scala> spark.sql("use db1") > scala> spark.sql("cache table table") > > scala> spark.sql("use db2") > scala> spark.sql("cache table table"){code} > You cannot differentiate it in the Storage tab > !image-2023-08-31-18-37-48-330.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45039) Include full identifier in Storage tab
Pablo Langa Blanco created SPARK-45039: -- Summary: Include full identifier in Storage tab Key: SPARK-45039 URL: https://issues.apache.org/jira/browse/SPARK-45039 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 4.0.0 Reporter: Pablo Langa Blanco Attachments: image-2023-08-31-18-38-37-856.png If you have 2 databases with tables with the same name like: db1.table, db2.table {code:java} scala> spark.sql("use db1") scala> spark.sql("cache table table") scala> spark.sql("use db2") scala> spark.sql("cache table table"){code} You cannot differentiate it in the Storage tab !image-2023-08-31-18-37-48-330.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44500) parse_url treats key as regular expression
[ https://issues.apache.org/jira/browse/SPARK-44500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746705#comment-17746705 ] Pablo Langa Blanco commented on SPARK-44500: [~jan.chou...@gmail.com] What do you think? > parse_url treats key as regular expression > -- > > Key: SPARK-44500 > URL: https://issues.apache.org/jira/browse/SPARK-44500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.4.1 >Reporter: Robert Joseph Evans >Priority: Major > > To be clear I am not 100% sure that this is a bug. It might be a feature, but > I don't see anywhere that it is used as a feature. If it is a feature it > really should be documented, because there are pitfalls. If it is a bug it > should be fixed because it is really confusing and it is simple to shoot > yourself in the foot. > ```scala > > val urls = Seq("http://foo/bar?abc=BAD=GOOD;, > > "http://foo/bar?a.c=GOOD=BAD;).toDF > > urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false) > ++ > |parse_url(value, QUERY, a.c)| > ++ > |BAD | > |GOOD| > ++ > > urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false) > java.util.regex.PatternSyntaxException: Unclosed character class near index 15 > (&|^)a[c=([^&]*) >^ > at java.util.regex.Pattern.error(Pattern.java:1969) > at java.util.regex.Pattern.clazz(Pattern.java:2562) > at java.util.regex.Pattern.sequence(Pattern.java:2077) > at java.util.regex.Pattern.expr(Pattern.java:2010) > at java.util.regex.Pattern.compile(Pattern.java:1702) > at java.util.regex.Pattern.(Pattern.java:1352) > at java.util.regex.Pattern.compile(Pattern.java:1028) > ``` > The simple fix is to quote the key when making the pattern. > ```scala > private def getPattern(key: UTF8String): Pattern = { > Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX) > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40679) add named_struct to spark scala api for parity to spark sql
[ https://issues.apache.org/jira/browse/SPARK-40679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694757#comment-17694757 ] Pablo Langa Blanco commented on SPARK-40679: I think the original development supports the creation of NamedStructs in the dataframe interface. [https://github.com/apache/spark/pull/6874/files] {code:java} /** * Creates a new struct column. * If the input column is a column in a [[DataFrame]], or a derived column expression * that is named (i.e. aliased), its name would be remained as the StructField's name, * otherwise, the newly generated StructField's name would be auto generated as col${index + 1}, * i.e. col1, col2, col3, ... */ def struct(cols: Column*): Column = { ...{code} {code:java} test("struct with column expression to be automatically named") { val df = Seq((1, "str")).toDF("a", "b") val result = df.select(struct((col("a") * 2), col("b"))) val expectedType = StructType(Seq( StructField("col1", IntegerType, nullable = false), StructField("b", StringType) )) assert(result.first.schema(0).dataType === expectedType) checkAnswer(result, Row(Row(2, "str"))) } {code} > add named_struct to spark scala api for parity to spark sql > --- > > Key: SPARK-40679 > URL: https://issues.apache.org/jira/browse/SPARK-40679 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Douglas Moore >Priority: Major > > To facilitate migration from cast(struct( to `named_struct` where > comparison of structures is needed. > To maintain parity between APIs. Understand that using `expr` would work, but > it wouldn't be typed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42407) `with as` executed again
[ https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693723#comment-17693723 ] Pablo Langa Blanco commented on SPARK-42407: In my opinion, "WITH AS" syntax is intended to simplify sql queries, but not to act at the execution level. To get what you want you can use "CACHE TABLE" combined with 'WITH AS'. > `with as` executed again > > > Key: SPARK-42407 > URL: https://issues.apache.org/jira/browse/SPARK-42407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3 >Reporter: yiku123 >Priority: Major > > When 'with as' is used multiple times, it will be executed again each time > without saving the results of' with as', resulting in low efficiency. > Will you consider improving the behavior of 'with as' > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721 ] Pablo Langa Blanco edited comment on SPARK-40525 at 2/26/23 10:57 PM: -- Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expect, but it's not the policy by default. I hope it help you. was (Author: planga82): Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expected, but it's not the policy by default. I hope it help you. > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; > 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > Time taken: 0.216 seconds > spark-sql> insert into int_floating_point_vals select 1.1; > Time taken: 1.747 seconds > spark-sql> select * from int_floating_point_vals; > 1 > Time taken: 0.518 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}INT{}}} and {{{}1.1{}}}). > h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of > the aforementioned value correctly raises an exception: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(Seq(Row(1.1))) > val schema = new StructType().add(StructField("c1", IntegerType, true)) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") > {code} > The following exception is raised: > {code:java} > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of int{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721 ] Pablo Langa Blanco commented on SPARK-40525: Hi [~x/sys] , When you are working with Spark Sql interface you can configure the behavior and you have 3 policies for type coercion rules. ([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you expected, but it's not the policy by default. I hope it help you. > Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame > but evaluates to a rounded value in SparkSQL > -- > > Key: SPARK-40525 > URL: https://issues.apache.org/jira/browse/SPARK-40525 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} > expectedly errors out. However, it is evaluated to a rounded value {{1}} if > the value is inserted into the table via {{{}spark-sql{}}}. > h3. Steps to reproduce: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: > {code:java} > $SPARK_HOME/bin/spark-sql {code} > Execute the following: > {code:java} > spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC; > 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > Time taken: 0.216 seconds > spark-sql> insert into int_floating_point_vals select 1.1; > Time taken: 1.747 seconds > spark-sql> select * from int_floating_point_vals; > 1 > Time taken: 0.518 seconds, Fetched 1 row(s){code} > h3. Expected behavior > We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) > to behave consistently for the same data type & input combination > ({{{}INT{}}} and {{{}1.1{}}}). > h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of > the aforementioned value correctly raises an exception: > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}: > {code:java} > $SPARK_HOME/bin/spark-shell{code} > Execute the following: > {code:java} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(Seq(Row(1.1))) > val schema = new StructType().add(StructField("c1", IntegerType, true)) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals") > {code} > The following exception is raised: > {code:java} > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of int{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42262) Table schema changes via V2SessionCatalog with HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-42262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682727#comment-17682727 ] Pablo Langa Blanco commented on SPARK-42262: I'm working on this > Table schema changes via V2SessionCatalog with HiveExternalCatalog > -- > > Key: SPARK-42262 > URL: https://issues.apache.org/jira/browse/SPARK-42262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Pablo Langa Blanco >Priority: Major > > When configuring HiveExternalCatalog there are certain schema change > operations in V2SessionCatalog that are not performed due to the use of the > alterTable method instead of alterTableDataSchema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42262) Table schema changes via V2SessionCatalog with HiveExternalCatalog
Pablo Langa Blanco created SPARK-42262: -- Summary: Table schema changes via V2SessionCatalog with HiveExternalCatalog Key: SPARK-42262 URL: https://issues.apache.org/jira/browse/SPARK-42262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Pablo Langa Blanco When configuring HiveExternalCatalog there are certain schema change operations in V2SessionCatalog that are not performed due to the use of the alterTable method instead of alterTableDataSchema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error
[ https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644061#comment-17644061 ] Pablo Langa Blanco commented on SPARK-41344: In this case the provider has been detected as DataSourceV2 and also implements SupportsCatalogOptions, so if it fails at that point, it does not make sense to try it as DataSource V1. The CatalogV2Util.loadTable function catches NoSuchTableException, NoSuchDatabaseException and NoSuchNamespaceException to return an option, which makes sense in other places where it is used, but not at this point. Maybe the best solution is to have another function that does not catch those exceptions to use in this case and does not return an option. > Reading V2 datasource masks underlying error > > > Key: SPARK-41344 > URL: https://issues.apache.org/jira/browse/SPARK-41344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.4.0 >Reporter: Kevin Cheung >Priority: Critical > Attachments: image-2022-12-03-09-24-43-285.png > > > In Spark 3.3, > # DataSourceV2Utils, the loadV2Source calls: > {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, > Some(catalog), Some(ident)). > # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with > any of these exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException, it would return None > # Coming back to DataSourceV2Utils, None was previously returned and calling > None.get results in a cryptic error technically "correct", but the *original > exceptions NoSuchTableException, NoSuchDatabaseException, > NoSuchNamespaceException are thrown away.* > > *Ask:* > Retain the original error and propagate this to the user. Prior to Spark 3.3, > the *original error* was shown and this seems like a design flaw. > > *Sample user facing error:* > None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209) > at scala.Option.flatMap(Option.scala:271) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171) > > *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137] > *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))* > {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341] > *CatalogV2Util.scala - catching the exceptions and return None* > [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?
[ https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-39623. Resolution: Not A Problem > partitionng by datestamp leads to wrong query on backend? > - > > Key: SPARK-39623 > URL: https://issues.apache.org/jira/browse/SPARK-39623 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dmitry >Priority: Major > > Hello, > I am new to Apache spark, so please bear with me. I would like to report what > seems to me a bug, but may be I am just not understanding something. > My goal is to run data analysis on a spark cluster. Data is stored in > PostgreSQL DB. Tables contained timestamped entries (timestamp with time > zone). > The code look like: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("foo") \ > .config("spark.jars", "/opt/postgresql-42.4.0.jar") \ > .getOrCreate() > df = spark.read \ > .format("jdbc") \ > .option("url", "jdbc:postgresql://example.org:5432/postgres") \ > .option("dbtable", "billing") \ > .option("user", "user") \ > .option("driver", "org.postgresql.Driver") \ > .option("numPartitions", "4") \ > .option("partitionColumn", "datestamp") \ > .option("lowerBound", "2022-01-01 00:00:00") \ > .option("upperBound", "2022-06-26 23:59:59") \ > .option("fetchsize", 100) \ > .load() > t0 = time.time() > print("Number of entries is => ", df.count(), " Time to execute ", > time.time()-t0) > ... > {code} > datestamp is timestamp with time zone. > I see this query on DB backend: > {code:java} > SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or > "datestamp" is null > {code} > The table is huge and entries go way back before > 2022-01-02 11:59:59. So what ends up happening - all workers but one > complete and one remaining continues to process that query which, to me, > looks like it wants to get all the data before 2022-01-02 11:59:59. Which is > not what I intended. > I remedies this by changing to: > {code:python} > .option("dbtable", "(select * from billinginfo where datestamp > '2022 > 01-01-01 00:00:00') as foo") \ > {code} > And that seem to have solved the issue. But this seems kludgy. Am I doing > something wrong or there is a bug in the way partitioning queries are > generated? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576282#comment-17576282 ] Pablo Langa Blanco commented on SPARK-39915: Hi [~zsxwing] , I can't reproduce it, do you have a typo in range? {code:java} scala> spark.range(0, 10).repartition(5).rdd.getNumPartitions res53: Int = 5{code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups
[ https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576281#comment-17576281 ] Pablo Langa Blanco edited comment on SPARK-39796 at 8/6/22 9:17 PM: Hi [~augustine_theodore] , Is this what are you loking for? {code:java} scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)" regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+) scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a") df: org.apache.spark.sql.DataFrame = [a: string] scala> df.withColumn("g1", regexp_extract('a, regex, 1)).withColumn("g2", regexp_extract('a, regex, 2)).show +--+-++ | a| g1| g2| +--+-++ |Hello, World, 1234|Hello|1234| | Good, bye, friend| | | +--+-++{code} was (Author: planga82): Hi [~augustine_theodore] , Is this what are you loking for? {code:java} scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)" regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+) scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a") df: org.apache.spark.sql.DataFrame = [a: string] scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 1)).withColumn("g2", regexp_extract('a, regex, 2)).show +--+-++ | a| g1| g2| +--+-++ |Hello, World, 1234|Hello|1234| | Good, bye, friend| | | +--+-++{code} > Add a regexp_extract variant which returns an array of all the matched > capture groups > - > > Key: SPARK-39796 > URL: https://issues.apache.org/jira/browse/SPARK-39796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Augustine Theodore Prince >Priority: Minor > Labels: regexp_extract, regexp_extract_all, regexp_replace > > > regexp_extract only returns a single matched group. In a lot of cases we need > to parse the entire string and get all the groups and for that we'll need to > call it as many times as there are groups. The regexp_extract_all function > doesn't solve this problem as it only works if all the groups have the same > regex pattern. > > _Example:_ > I will provide an example and the current workaround that I use to solve this, > If I have the following dataframe and I would like to match the column 'a' > with this pattern > {code:java} > "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code} > |a| > |Hello, World, 1234| > |Good, bye, friend| > > My expected output is as follows: > |a|extracted_a| > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|[]| > > However, to achieve this I have to take the following approach which seems > very hackish. > 1. Use regexp_replace to create a temporary string built using the extracted > groups: > {code:java} > df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, > (\\d+)", "$1_$2")){code} > A side effect of regexp_replace is that if the regex fails to match the > entire string is returned. > > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|Good, bye, friend| > 2. So, to achieve the desired result, a check has to be done to prune the > rows that did not match with the pattern : > {code:java} > df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , > None).otherwise(F.col("extracted_a"))){code} > > to get the following intermediate dataframe, > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|null| > > 3. Before finally splitting the column 'extracted_a' based on underscores > {code:java} > df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code} > which results in the desired result: > > > |a|extracted_a > | > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|null| > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups
[ https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576281#comment-17576281 ] Pablo Langa Blanco commented on SPARK-39796: Hi [~augustine_theodore] , Is this what are you loking for? {code:java} scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)" regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+) scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a") df: org.apache.spark.sql.DataFrame = [a: string] scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 1)).withColumn("g2", regexp_extract('a, regex, 2)).show +--+-++ | a| g1| g2| +--+-++ |Hello, World, 1234|Hello|1234| | Good, bye, friend| | | +--+-++{code} > Add a regexp_extract variant which returns an array of all the matched > capture groups > - > > Key: SPARK-39796 > URL: https://issues.apache.org/jira/browse/SPARK-39796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Augustine Theodore Prince >Priority: Minor > Labels: regexp_extract, regexp_extract_all, regexp_replace > > > regexp_extract only returns a single matched group. In a lot of cases we need > to parse the entire string and get all the groups and for that we'll need to > call it as many times as there are groups. The regexp_extract_all function > doesn't solve this problem as it only works if all the groups have the same > regex pattern. > > _Example:_ > I will provide an example and the current workaround that I use to solve this, > If I have the following dataframe and I would like to match the column 'a' > with this pattern > {code:java} > "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code} > |a| > |Hello, World, 1234| > |Good, bye, friend| > > My expected output is as follows: > |a|extracted_a| > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|[]| > > However, to achieve this I have to take the following approach which seems > very hackish. > 1. Use regexp_replace to create a temporary string built using the extracted > groups: > {code:java} > df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, > (\\d+)", "$1_$2")){code} > A side effect of regexp_replace is that if the regex fails to match the > entire string is returned. > > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|Good, bye, friend| > 2. So, to achieve the desired result, a check has to be done to prune the > rows that did not match with the pattern : > {code:java} > df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , > None).otherwise(F.col("extracted_a"))){code} > > to get the following intermediate dataframe, > |a|extracted_a| > |Hello, World, 1234|Hello_1234| > |Good, bye, friend|null| > > 3. Before finally splitting the column 'extracted_a' based on underscores > {code:java} > df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code} > which results in the desired result: > > > |a|extracted_a > | > |Hello, World, 1234|[Hello, 1234]| > |Good, bye, friend|null| > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39426) Subquery star select creates broken plan in case of self join
[ https://issues.apache.org/jira/browse/SPARK-39426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564735#comment-17564735 ] Pablo Langa Blanco commented on SPARK-39426: I tested it on master and 3.3.0 and it seems to be fixed. > Subquery star select creates broken plan in case of self join > - > > Key: SPARK-39426 > URL: https://issues.apache.org/jira/browse/SPARK-39426 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Denis >Priority: Major > > Subquery star select creates broken plan in case of self join > How to reproduce: > {code:java} > import spark.implicits._ > spark.sparkContext.setCheckpointDir(Files.createTempDirectory("some-prefix").toFile.toString) > val frame = Seq(1).toDF("id").checkpoint() > val joined = frame > .join(frame, Seq("id"), "left") > .select("id") > joined > .join(joined, Seq("id"), "left") > .as("a") > .select("a.*"){code} > This query throws exception: > {code:java} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved > attribute(s) id#7 missing from id#10,id#11 in operator !Project [id#7, > id#10]. Attribute(s) with the same name appear in the operation: id. Please > check if the right attribute(s) are used.; > Project [id#10, id#4] > +- SubqueryAlias a > +- Project [id#10, id#4] > +- Join LeftOuter, (id#4 = id#10) > :- Project [id#4] > : +- Project [id#7, id#4] > : +- Join LeftOuter, (id#4 = id#7) > : :- LogicalRDD [id#4], false > : +- LogicalRDD [id#7], false > +- Project [id#10] > +- !Project [id#7, id#10] > +- Join LeftOuter, (id#10 = id#11) > :- LogicalRDD [id#10], false > +- LogicalRDD [id#11], false at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:182) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:471) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) > at scala.collection.Iterator.foreach(Iterator.scala:943)
[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?
[ https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564730#comment-17564730 ] Pablo Langa Blanco commented on SPARK-39623: I think the problem here is a misunderstanding of how lowerBound and upperBound work. This options only affect on how spark generate the partitions, but all data is returned, this options don't work as a filter. For example with this configuration {code:java} .option("numPartitions", "4") .option("partitionColumn", "t") .option("lowerBound", "2022-07-10 00:00:00") .option("upperBound", "2022-07-10 23:59:00") {code} The expected filters of the queries are {code:sql} WHERE "t" < '2022-07-10 05:59:45' or "t" is null WHERE "t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30' WHERE "t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15' WHERE "t" >= '2022-07-10 17:59:15' {code} If you want to filter the data, you can do it as you have shown or doing something like that {code:java} df.where(col("t") >= lit("2022-07-10 00:00:00")) {code} and then spark pushes down this filter and generates these partitions: {code:sql} WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" < '2022-07-10 05:59:45' or "t" is null) WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30') WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15') WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 17:59:15') {code} > partitionng by datestamp leads to wrong query on backend? > - > > Key: SPARK-39623 > URL: https://issues.apache.org/jira/browse/SPARK-39623 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dmitry >Priority: Major > > Hello, > I am new to Apache spark, so please bear with me. I would like to report what > seems to me a bug, but may be I am just not understanding something. > My goal is to run data analysis on a spark cluster. Data is stored in > PostgreSQL DB. Tables contained timestamped entries (timestamp with time > zone). > The code look like: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("foo") \ > .config("spark.jars", "/opt/postgresql-42.4.0.jar") \ > .getOrCreate() > df = spark.read \ > .format("jdbc") \ > .option("url", "jdbc:postgresql://example.org:5432/postgres") \ > .option("dbtable", "billing") \ > .option("user", "user") \ > .option("driver", "org.postgresql.Driver") \ > .option("numPartitions", "4") \ > .option("partitionColumn", "datestamp") \ > .option("lowerBound", "2022-01-01 00:00:00") \ > .option("upperBound", "2022-06-26 23:59:59") \ > .option("fetchsize", 100) \ > .load() > t0 = time.time() > print("Number of entries is => ", df.count(), " Time to execute ", > time.time()-t0) > ... > {code} > datestamp is timestamp with time zone. > I see this query on DB backend: > {code:java} > SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or > "datestamp" is null > {code} > The table is huge and entries go way back before > 2022-01-02 11:59:59. So what ends up happening - all workers but one > complete and one remaining continues to process that query which, to me, > looks like it wants to get all the data before 2022-01-02 11:59:59. Which is > not what I intended. > I remedies this by changing to: > {code:python} > .option("dbtable", "(select * from billinginfo where datestamp > '2022 > 01-01-01 00:00:00') as foo") \ > {code} > And that seem to have solved the issue. But this seems kludgy. Am I doing > something wrong or there is a bug in the way partitioning queries are > generated? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38714) Interval multiplication error
[ https://issues.apache.org/jira/browse/SPARK-38714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-38714. Resolution: Resolved > Interval multiplication error > - > > Key: SPARK-38714 > URL: https://issues.apache.org/jira/browse/SPARK-38714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: branch-3.3, Java 8 > >Reporter: chong >Priority: Major > > Code gen have error when multipling interval by a decimal. > > $SPARK_HOME/bin/spark-shell > > import org.apache.spark.sql.Row > import java.time.Duration > import java.time.Period > import org.apache.spark.sql.types._ > val data = Seq(Row(new java.math.BigDecimal("123456789.11"))) > val schema = StructType(Seq( > StructField("c1", DecimalType(9, 2)), > )) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.selectExpr("interval '100' second * c1").show(false) > errors are: > *{color:#FF}java.lang.AssertionError: assertion failed:{color}* > Decimal$DecimalIsFractional > while compiling: > during phase: globalPhase=terminal, enteringPhase=jvm > library version: version 2.12.15 > compiler version: version 2.12.15 > reconstructed args: -classpath -Yrepl-class-based -Yrepl-outdir > /tmp/spark-83a0cda4-dd0b-472e-ad8b-a4b33b85f613/repl-06489815-5366-4aa0-9419-f01abda8d041 > last tree to typer: TypeTree(class Byte) > tree position: line 6 of > tree tpe: Byte > symbol: (final abstract) class Byte in package scala > symbol definition: final abstract class Byte extends (a ClassSymbol) > symbol package: scala > symbol owners: class Byte > call site: constructor $eval in object $eval in package $line21 > == Source file context for tree position == > 3 > 4 object $eval { > 5 lazy val $result = > $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 > 6 lazy val $print: {_}root{_}.java.lang.String = { > 7 $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw > 8 > 9 "" > at > scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) > at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) > at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) > at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) > at > scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) > at > scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) > at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) > at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) > at > scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) > at > scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) > at > scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) > at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) > at > scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96) > at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88) > at scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.unpickleOrParseInnerClasses(ClassfileParser.scala:1186) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:468) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$2(ClassfileParser.scala:161) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$1(ClassfileParser.scala:147) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:130) > at > scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:343) > at > scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:250) > at > scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.load(SymbolLoaders.scala:269) > at
[jira] [Commented] (SPARK-38714) Interval multiplication error
[ https://issues.apache.org/jira/browse/SPARK-38714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564619#comment-17564619 ] Pablo Langa Blanco commented on SPARK-38714: I have tested it in master and branch 3.3 and it's solved. > Interval multiplication error > - > > Key: SPARK-38714 > URL: https://issues.apache.org/jira/browse/SPARK-38714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: branch-3.3, Java 8 > >Reporter: chong >Priority: Major > > Code gen have error when multipling interval by a decimal. > > $SPARK_HOME/bin/spark-shell > > import org.apache.spark.sql.Row > import java.time.Duration > import java.time.Period > import org.apache.spark.sql.types._ > val data = Seq(Row(new java.math.BigDecimal("123456789.11"))) > val schema = StructType(Seq( > StructField("c1", DecimalType(9, 2)), > )) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > df.selectExpr("interval '100' second * c1").show(false) > errors are: > *{color:#FF}java.lang.AssertionError: assertion failed:{color}* > Decimal$DecimalIsFractional > while compiling: > during phase: globalPhase=terminal, enteringPhase=jvm > library version: version 2.12.15 > compiler version: version 2.12.15 > reconstructed args: -classpath -Yrepl-class-based -Yrepl-outdir > /tmp/spark-83a0cda4-dd0b-472e-ad8b-a4b33b85f613/repl-06489815-5366-4aa0-9419-f01abda8d041 > last tree to typer: TypeTree(class Byte) > tree position: line 6 of > tree tpe: Byte > symbol: (final abstract) class Byte in package scala > symbol definition: final abstract class Byte extends (a ClassSymbol) > symbol package: scala > symbol owners: class Byte > call site: constructor $eval in object $eval in package $line21 > == Source file context for tree position == > 3 > 4 object $eval { > 5 lazy val $result = > $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 > 6 lazy val $print: {_}root{_}.java.lang.String = { > 7 $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw > 8 > 9 "" > at > scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) > at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) > at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) > at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) > at > scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) > at > scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) > at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) > at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) > at > scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) > at > scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) > at > scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) > at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357) > at > scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96) > at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88) > at scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.unpickleOrParseInnerClasses(ClassfileParser.scala:1186) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:468) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$2(ClassfileParser.scala:161) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$1(ClassfileParser.scala:147) > at > scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:130) > at > scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:343) > at > scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:250) > at >
[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names
[ https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500419#comment-17500419 ] Pablo Langa Blanco commented on SPARK-37895: I'm working on a fix > Error while joining two tables with non-english field names > --- > > Key: SPARK-37895 > URL: https://issues.apache.org/jira/browse/SPARK-37895 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Marina Krasilnikova >Priority: Minor > > While trying to join two tables with non-english field names in postgresql > with query like > "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join view2 > on view1.`Имя2`=view2.`Имя4`" > we get an error which says that there is no field "`Имя4`" (field name is > surrounded by backticks). > It appears that to get the data from the second table it constructs query like > SELECT "Имя3","Имя4" FROM "public"."tab2" WHERE ("`Имя4`" IS NOT NULL) > and these backticks are redundant in WHERE clause. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression
[ https://issues.apache.org/jira/browse/SPARK-36698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-36698: --- Description: With {code:java} spark.sql.parser.quotedRegexColumnNames=true{code} regular expressions are not allowed in expressions like {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} I propose to improve this behavior and allow this regular expression when it expands to only one named expression. It’s the same behavior in Hive. *Example 1:* {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} col_.* expands to col_a: OK *Example 2:* {code:java} SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code} col_.* expands to (col_a, col_b): Fail like now Related with SPARK-36488 was: With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not allowed in expressions like {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} I propose to improve this behavior and allow this regular expression when it expands to only one named expression. It’s the same behavior in Hive. *Example 1:* {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} col_.* expands to col_a: OK *Example 2:* {code:java} SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code} col_.* expands to (col_a, col_b): Fail like now Related with SPARK-36488 > Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded > to one named expression > --- > > Key: SPARK-36698 > URL: https://issues.apache.org/jira/browse/SPARK-36698 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > With > {code:java} > spark.sql.parser.quotedRegexColumnNames=true{code} > regular expressions are not allowed in expressions like > {code:java} > SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} > I propose to improve this behavior and allow this regular expression when it > expands to only one named expression. It’s the same behavior in Hive. > *Example 1:* > {code:java} > SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} > col_.* expands to col_a: OK > *Example 2:* > {code:java} > SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code} > col_.* expands to (col_a, col_b): Fail like now > > Related with SPARK-36488 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression
[ https://issues.apache.org/jira/browse/SPARK-36698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412220#comment-17412220 ] Pablo Langa Blanco commented on SPARK-36698: I'm working on it > Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded > to one named expression > --- > > Key: SPARK-36698 > URL: https://issues.apache.org/jira/browse/SPARK-36698 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not > allowed in expressions like > {code:java} > SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} > I propose to improve this behavior and allow this regular expression when it > expands to only one named expression. It’s the same behavior in Hive. > *Example 1:* > {code:java} > SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} > col_.* expands to col_a: OK > *Example 2:* > {code:java} > SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code} > col_.* expands to (col_a, col_b): Fail like now > > Related with SPARK-36488 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression
Pablo Langa Blanco created SPARK-36698: -- Summary: Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression Key: SPARK-36698 URL: https://issues.apache.org/jira/browse/SPARK-36698 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Pablo Langa Blanco With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not allowed in expressions like {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} I propose to improve this behavior and allow this regular expression when it expands to only one named expression. It’s the same behavior in Hive. *Example 1:* {code:java} SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code} col_.* expands to col_a: OK *Example 2:* {code:java} SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code} col_.* expands to (col_a, col_b): Fail like now Related with SPARK-36488 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
[ https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403329#comment-17403329 ] Pablo Langa Blanco commented on SPARK-36488: Yes, I'm with you that it's not very intuitive and it have room for improvement. I think it's interesting to take a closer look to support more expressions and to be more close to hive feature. Thanks! > "Invalid usage of '*' in expression" error due to the feature of > 'quotedRegexColumnNames' in some scenarios. > > > Key: SPARK-36488 > URL: https://issues.apache.org/jira/browse/SPARK-36488 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: merrily01 >Priority: Major > > In some cases, the error happens when the following property is set. > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > *case 1:* > {code:java} > spark-sql> create table tb_test as select 1 as col_a, 2 as col_b; > spark-sql> select `tb_test`.`col_a` from tb_test; > 1 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `tb_test`.`col_a` from tb_test; > Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue' > {code} > > *case 2:* > {code:java} > > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > 0.955414 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > Error in query: Invalid usage of '*' in expression 'divide' > {code} > > This problem exists in 3.X, 2.4.X and master versions. > > Related issue : > https://issues.apache.org/jira/browse/SPARK-12139 > (As can be seen in the latest comments, some people have encountered the same > problem) > > Similar problems: > https://issues.apache.org/jira/browse/SPARK-28897 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35933) PartitionFilters and pushFilters not applied to window functions
[ https://issues.apache.org/jira/browse/SPARK-35933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401895#comment-17401895 ] Pablo Langa Blanco commented on SPARK-35933: can we close this ticket? > PartitionFilters and pushFilters not applied to window functions > > > Key: SPARK-35933 > URL: https://issues.apache.org/jira/browse/SPARK-35933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: Shreyas Kothavade >Priority: Major > > Spark does not apply partition and pushed filters when the partition by > column and window function partition columns are not the same. For example, > in the code below, the data frame is created with a partition on "id". And I > use the partitioned data frame to calculate lag which is partitioned by > "name". In this case, the query plan shows the partitionFilters and pushed > Filters as empty. > {code:java} > spark > .createDataFrame( > Seq( > Person( > 1, > "Andy", > new Timestamp(1499955986039L), > new Timestamp(1499955982342L) > ), > Person( > 2, > "Jeff", > new Timestamp(1499955986339L), > new Timestamp(149995598L) > ) > ) > ) > .write > .partitionBy("id") > .mode(SaveMode.Append) > .parquet("spark-warehouse/people") > val dfPeople = > spark.read.parquet("spark-warehouse/people") > dfPeople > .select( > $"id", > $"name", > lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts")) > ) > .filter($"id" === 1) > .explain() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.
[ https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401401#comment-17401401 ] Pablo Langa Blanco commented on SPARK-36488: Hi [~merrily01] , I think these are not bugs. As you can review in the documentation ([https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html]) when you set the parameter spark.sql.parser.quotedRegexColumnNames=true “quoted identifiers (using backticks) in SELECT statement are interpreted as regular expressions and SELECT statement can take regex-based column specification”. In case 1, this means that in the expression `tb_test`.`col_a` tb_test is treated as a regular expression that represents a column. And this syntaxis “column.field” is used to access structType columns. And in this case it is not allowed to use regular expressions. In case 2, a regex can retrieve more than one column, for example `col_*` is resolved to col_a, col_b, so it does not make sense that the operators of a division are a list of columns, this is not allowed. I’m going to open a PR trying to improve the error message to avoid confusion. > "Invalid usage of '*' in expression" error due to the feature of > 'quotedRegexColumnNames' in some scenarios. > > > Key: SPARK-36488 > URL: https://issues.apache.org/jira/browse/SPARK-36488 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.8, 3.1.2 >Reporter: merrily01 >Priority: Major > > In some cases, the error happens when the following property is set. > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > *case 1:* > {code:java} > spark-sql> create table tb_test as select 1 as col_a, 2 as col_b; > spark-sql> select `tb_test`.`col_a` from tb_test; > 1 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `tb_test`.`col_a` from tb_test; > Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue' > {code} > > *case 2:* > {code:java} > > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > 0.955414 > spark-sql> set spark.sql.parser.quotedRegexColumnNames=true; > spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` , > 3.14 as `col_b`); > Error in query: Invalid usage of '*' in expression 'divide' > {code} > > This problem exists in 3.X, 2.4.X and master versions. > > Related issue : > https://issues.apache.org/jira/browse/SPARK-12139 > (As can be seen in the latest comments, some people have encountered the same > problem) > > Similar problems: > https://issues.apache.org/jira/browse/SPARK-28897 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36453) Improve consistency processing floating point special literals
[ https://issues.apache.org/jira/browse/SPARK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395658#comment-17395658 ] Pablo Langa Blanco commented on SPARK-36453: I'm working on it > Improve consistency processing floating point special literals > -- > > Key: SPARK-36453 > URL: https://issues.apache.org/jira/browse/SPARK-36453 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > Special literal in floating point are not consistent between cast and json > expressions > > {code:java} > scala> spark.sql("SELECT CAST('+Inf' as Double)").show > ++ > |CAST(+Inf AS DOUBLE)| > ++ > | Infinity| > ++ > {code} > > {code:java} > scala> val schema = StructType(StructField("a", DoubleType) :: Nil) > scala> Seq("""{"a" : > "+Inf"}""").toDF("col1").select(from_json(col("col1"),schema)).show > +---+ > |from_json(col1)| > +---+ > | {null}| > +---+ > scala> Seq("""{"a" : "+Inf"}""").toDF("col").withColumn("col", > from_json(col("col"), StructType.fromDDL("a > DOUBLE"))).write.json("/tmp/jsontests12345") > scala> > spark.read.schema(StructType(Seq(StructField("col",schema.json("/tmp/jsontests12345").show > +--+ > | col| > +--+ > |{null}| > +--+ > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36453) Improve consistency processing floating point special literals
Pablo Langa Blanco created SPARK-36453: -- Summary: Improve consistency processing floating point special literals Key: SPARK-36453 URL: https://issues.apache.org/jira/browse/SPARK-36453 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Pablo Langa Blanco Special literal in floating point are not consistent between cast and json expressions {code:java} scala> spark.sql("SELECT CAST('+Inf' as Double)").show ++ |CAST(+Inf AS DOUBLE)| ++ | Infinity| ++ {code} {code:java} scala> val schema = StructType(StructField("a", DoubleType) :: Nil) scala> Seq("""{"a" : "+Inf"}""").toDF("col1").select(from_json(col("col1"),schema)).show +---+ |from_json(col1)| +---+ | {null}| +---+ scala> Seq("""{"a" : "+Inf"}""").toDF("col").withColumn("col", from_json(col("col"), StructType.fromDDL("a DOUBLE"))).write.json("/tmp/jsontests12345") scala> spark.read.schema(StructType(Seq(StructField("col",schema.json("/tmp/jsontests12345").show +--+ | col| +--+ |{null}| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35320) from_json cannot parse maps with timestamp as key
[ https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345782#comment-17345782 ] Pablo Langa Blanco commented on SPARK-35320: I'm taking a look at this > from_json cannot parse maps with timestamp as key > - > > Key: SPARK-35320 > URL: https://issues.apache.org/jira/browse/SPARK-35320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.1 > Environment: * Java 11 > * Spark 3.0.1/3.1.1 > * Scala 2.12 >Reporter: Vincenzo Cerminara >Priority: Minor > > I have a json that contains a {{map}} like the following > {code:json} > { > "map": { > "2021-05-05T20:05:08": "sampleValue" > } > } > {code} > The key of the map is a string containing a formatted timestamp and I want to > parse it as a Java Map using the from_json > Spark SQL function (see the {{Sample}} class in the code below). > {code:java} > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Encoders; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import java.io.Serializable; > import java.time.Instant; > import java.util.List; > import java.util.Map; > import static org.apache.spark.sql.functions.*; > public class TimestampAsJsonMapKey { > public static class Sample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static class InvertedSample implements Serializable { > private Map map; > > public Map getMap() { > return map; > } > > public void setMap(Map map) { > this.map = map; > } > } > public static void main(String[] args) { > final SparkSession spark = SparkSession > .builder() > .appName("Timestamp As Json Map Key Test") > .master("local[1]") > .getOrCreate(); > workingTest(spark); > notWorkingTest(spark); > } > private static void workingTest(SparkSession spark) { > //language=JSON > final String invertedSampleJson = "{ \"map\": { \"sampleValue\": > \"2021-05-05T20:05:08\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(invertedSampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(InvertedSample.class).schema())); > parsedDf.show(false); > } > private static void notWorkingTest(SparkSession spark) { > //language=JSON > final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": > \"sampleValue\" } }"; > final Dataset samplesDf = > spark.createDataset(List.of(sampleJson), Encoders.STRING()); > final Dataset parsedDf = > samplesDf.select(from_json(col("value"), > Encoders.bean(Sample.class).schema())); > parsedDf.show(false); > } > } > {code} > When I run the {{notWorkingTest}} method it fails with the following > exception: > {noformat} > Exception in thread "main" java.lang.ClassCastException: class > org.apache.spark.unsafe.types.UTF8String cannot be cast to class > java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module > of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap') > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352) > at > org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156) > at >
[jira] [Commented] (SPARK-35207) hash() and other hash builtins do not normalize negative zero
[ https://issues.apache.org/jira/browse/SPARK-35207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338698#comment-17338698 ] Pablo Langa Blanco commented on SPARK-35207: Hi [~tarmstrong] , I have read about signed zero and it's a little bit tricky, from what I have read in IEEE 754 and it's implementation in Java I understand the same as you it's not consistent to have different hash values for 0 and -0. This applays to double and float. Here is an extract from IEEE 754: "The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases" I will raise a PR with this. Regards > hash() and other hash builtins do not normalize negative zero > - > > Key: SPARK-35207 > URL: https://issues.apache.org/jira/browse/SPARK-35207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Tim Armstrong >Priority: Major > Labels: correctness > > I would generally expect that {{x = y => hash( x ) = hash( y )}}. However +-0 > hash to different values for floating point types. > {noformat} > scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as > double))").show > +-+--+ > |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))| > +-+--+ > | -1670924195|-853646085| > +-+--+ > scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as > double)").show > ++ > |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))| > ++ > |true| > ++ > {noformat} > I'm not sure how likely this is to cause issues in practice, since only a > limited number of calculations can produce -0 and joining or aggregating with > floating point keys is a bad practice as a general rule, but I think it would > be safer if we normalised -0.0 to +0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35249) to_timestamp can't parse 6 digit microsecond SSSSSS
[ https://issues.apache.org/jira/browse/SPARK-35249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338139#comment-17338139 ] Pablo Langa Blanco commented on SPARK-35249: Hi [~toopt4], I think there are two different issues here. First, the correct pattern for this string is as follows. {code:java} select x, to_timestamp(x,"MMM dd hh:mm:ss.SSa") from (select 'Apr 13 2021 12:00:00.001000AM' x); {code} The other problem is that in spark 2.4.6 this pattern (SS) with more than 5 characters apparently doesn't work. But I have tested it in 3.1.1 version and it works fine. Regards > to_timestamp can't parse 6 digit microsecond SS > --- > > Key: SPARK-35249 > URL: https://issues.apache.org/jira/browse/SPARK-35249 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > > spark-sql> select x, to_timestamp(x,"MMM dd hh:mm:ss.SS") from > (select 'Apr 13 2021 12:00:00.001000AM' x); > Apr 13 2021 12:00:00.001000AM NULL > > Why doesn't the to_timestamp work? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"
[ https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315782#comment-17315782 ] Pablo Langa Blanco commented on SPARK-34464: Hi [~lfruleux] , Thank you for reporting it! I think it's a good point to try to improve. Yeah, Last and other functions are in the same situation, so if we go forward with this changes we could apply the same to this functions. All help is welcome! :D > `first` function is sorting the dataset while sometimes it is used to get > "any value" > - > > Key: SPARK-34464 > URL: https://issues.apache.org/jira/browse/SPARK-34464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Louis Fruleux >Priority: Minor > > When one wants to groupBy and take any value (not necessarily the first), one > usually uses > [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] > aggregation function. > Unfortunately, this method uses a `SortAggregate` for some data types, which > is not always necessary and might impact performances. Is this the desired > behavior? > > > {code:java} > Current behavior: > val df = Seq((0, "value")).toDF("key", "value") > df.groupBy("key").agg(first("value")).explain() > /* > == Physical Plan == > SortAggregate(key=key#342, functions=first(value#343, false)) > +- *(2) Sort key#342 ASC NULLS FIRST, false, 0 > +- Exchange hashpartitioning(key#342, 200) > +- SortAggregate(key=key#342, functions=partial_first(value#343, > false)) > +- *(1) Sort key#342 ASC NULLS FIRST, false, 0 > +- LocalTableScan key#342, value#343 > */ > {code} > > My understanding of the source code does not allow me to fully understand why > this is the current behavior. > The solution might be to implement a new aggregate function. But the code > would be highly similar to the first one. And if I don't fully understand why > this > [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] > method falls back to SortAggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"
[ https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315152#comment-17315152 ] Pablo Langa Blanco commented on SPARK-34464: I've open SPARK-34961 because I think It's an improvement not a bug so I will raise a PR there. > `first` function is sorting the dataset while sometimes it is used to get > "any value" > - > > Key: SPARK-34464 > URL: https://issues.apache.org/jira/browse/SPARK-34464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Louis Fruleux >Priority: Minor > > When one wants to groupBy and take any value (not necessarily the first), one > usually uses > [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] > aggregation function. > Unfortunately, this method uses a `SortAggregate` for some data types, which > is not always necessary and might impact performances. Is this the desired > behavior? > > > {code:java} > Current behavior: > val df = Seq((0, "value")).toDF("key", "value") > df.groupBy("key").agg(first("value")).explain() > /* > == Physical Plan == > SortAggregate(key=key#342, functions=first(value#343, false)) > +- *(2) Sort key#342 ASC NULLS FIRST, false, 0 > +- Exchange hashpartitioning(key#342, 200) > +- SortAggregate(key=key#342, functions=partial_first(value#343, > false)) > +- *(1) Sort key#342 ASC NULLS FIRST, false, 0 > +- LocalTableScan key#342, value#343 > */ > {code} > > My understanding of the source code does not allow me to fully understand why > this is the current behavior. > The solution might be to implement a new aggregate function. But the code > would be highly similar to the first one. And if I don't fully understand why > this > [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] > method falls back to SortAggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"
[ https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-34464. Resolution: Not A Bug > `first` function is sorting the dataset while sometimes it is used to get > "any value" > - > > Key: SPARK-34464 > URL: https://issues.apache.org/jira/browse/SPARK-34464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Louis Fruleux >Priority: Minor > > When one wants to groupBy and take any value (not necessarily the first), one > usually uses > [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] > aggregation function. > Unfortunately, this method uses a `SortAggregate` for some data types, which > is not always necessary and might impact performances. Is this the desired > behavior? > > > {code:java} > Current behavior: > val df = Seq((0, "value")).toDF("key", "value") > df.groupBy("key").agg(first("value")).explain() > /* > == Physical Plan == > SortAggregate(key=key#342, functions=first(value#343, false)) > +- *(2) Sort key#342 ASC NULLS FIRST, false, 0 > +- Exchange hashpartitioning(key#342, 200) > +- SortAggregate(key=key#342, functions=partial_first(value#343, > false)) > +- *(1) Sort key#342 ASC NULLS FIRST, false, 0 > +- LocalTableScan key#342, value#343 > */ > {code} > > My understanding of the source code does not allow me to fully understand why > this is the current behavior. > The solution might be to implement a new aggregate function. But the code > would be highly similar to the first one. And if I don't fully understand why > this > [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] > method falls back to SortAggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance
Pablo Langa Blanco created SPARK-34961: -- Summary: Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance Key: SPARK-34961 URL: https://issues.apache.org/jira/browse/SPARK-34961 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Pablo Langa Blanco The main objective of this change is to improve performance in some cases. We have three possibilities when we plan an aggregation. In the first case, with mutable primitive types, HashAggregate is used. When we are not using these types we have two options. If the function implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we use SortAggregate that is less efficient. In this PR I propose to migrate First function to implement TypedImperativeAggregate to take advantage of this feature (ObjectAggregateExec) This Jira is related to SPARK-34464 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"
[ https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311845#comment-17311845 ] Pablo Langa Blanco commented on SPARK-34464: Hi [~lfruleux], Here it's a link that explain very good the reasons when the different types of aggregations are applied. [https://www.waitingforcode.com/apache-spark-sql/aggregations-execution-apache-spark-sql/read] In the case you expose there are two things that make the aggregation fallback in a SortAggregate. The first is that the types of the aggregation are not primitive mutable types (necessary for HashAggregate). The first fallback is ObjectHashAggregate, but in this case first function is not supported by ObjectHashAggregate because it's not a TypedImperativeAggregate, so it fallback to SorteAggregate. I don't know if this has any reason, I'm going to take a look if it's possible to TypedImperativeAggregate to fallback to ObjectHashAggregate. Thanks! > `first` function is sorting the dataset while sometimes it is used to get > "any value" > - > > Key: SPARK-34464 > URL: https://issues.apache.org/jira/browse/SPARK-34464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Louis Fruleux >Priority: Minor > > When one wants to groupBy and take any value (not necessarily the first), one > usually uses > [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] > aggregation function. > Unfortunately, this method uses a `SortAggregate` for some data types, which > is not always necessary and might impact performances. Is this the desired > behavior? > > > {code:java} > Current behavior: > val df = Seq((0, "value")).toDF("key", "value") > df.groupBy("key").agg(first("value")).explain() > /* > == Physical Plan == > SortAggregate(key=key#342, functions=first(value#343, false)) > +- *(2) Sort key#342 ASC NULLS FIRST, false, 0 > +- Exchange hashpartitioning(key#342, 200) > +- SortAggregate(key=key#342, functions=partial_first(value#343, > false)) > +- *(1) Sort key#342 ASC NULLS FIRST, false, 0 > +- LocalTableScan key#342, value#343 > */ > {code} > > My understanding of the source code does not allow me to fully understand why > this is the current behavior. > The solution might be to implement a new aggregate function. But the code > would be highly similar to the first one. And if I don't fully understand why > this > [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] > method falls back to SortAggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33744) Canonicalization error in SortAggregate
[ https://issues.apache.org/jira/browse/SPARK-33744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248574#comment-17248574 ] Pablo Langa Blanco commented on SPARK-33744: I have seen the problem, but I don't have enough knowledge to give a solution. Sorry > Canonicalization error in SortAggregate > --- > > Key: SPARK-33744 > URL: https://issues.apache.org/jira/browse/SPARK-33744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Andy Grove >Priority: Minor > > The canonicalization plan for a simple aggregate query is different each time > for SortAggregate but not for HashAggregate. > The issue can be demonstrated by adding the following unit tests to > SQLQuerySuite. The HashAggregate test passes and the SortAggregate test fails. > The first test has numeric input and the second test is operating on strings, > which forces the use of SortAggregate rather than HashAggregate. > {code:java} > test("HashAggregate canonicalization") { > val data = Seq((1, 1)).toDF("c0", "c1") > val df1 = data.groupBy(col("c0")).agg(first("c1")) > val df2 = data.groupBy(col("c0")).agg(first("c1")) > assert(df1.queryExecution.executedPlan.canonicalized == > df2.queryExecution.executedPlan.canonicalized) > } > test("SortAggregate canonicalization") { > val data = Seq(("a", "a")).toDF("c0", "c1") > val df1 = data.groupBy(col("c0")).agg(first("c1")) > val df2 = data.groupBy(col("c0")).agg(first("c1")) > assert(df1.queryExecution.executedPlan.canonicalized == > df2.queryExecution.executedPlan.canonicalized) > } {code} > The SortAggregate test fails with the following output . > {code:java} > SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, > #1]) > +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#105] > +- SortAggregate(key=[none#0], functions=[partial_first(none#1, > false)], output=[none#0, none#2, none#3]) > +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0 > +- *(1) Project [none#0 AS #0, none#1 AS #1] >+- *(1) LocalTableScan [none#0, none#1] > did not equal > SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, > #1]) > +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#148] > +- SortAggregate(key=[none#0], functions=[partial_first(none#1, > false)], output=[none#0, none#2, none#3]) > +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0 > +- *(1) Project [none#0 AS #0, none#1 AS #1] >+- *(1) LocalTableScan [none#0, none#1] {code} > The error is caused by the resultExpression for the aggregate function being > assigned a new ExprId in the final aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33744) Canonicalization error in SortAggregate
[ https://issues.apache.org/jira/browse/SPARK-33744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248417#comment-17248417 ] Pablo Langa Blanco commented on SPARK-33744: I'm taking a look at it, thanks for reporting. > Canonicalization error in SortAggregate > --- > > Key: SPARK-33744 > URL: https://issues.apache.org/jira/browse/SPARK-33744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Andy Grove >Priority: Minor > > The canonicalization plan for a simple aggregate query is different each time > for SortAggregate but not for HashAggregate. > The issue can be demonstrated by adding the following unit tests to > SQLQuerySuite. The HashAggregate test passes and the SortAggregate test fails. > The first test has numeric input and the second test is operating on strings, > which forces the use of SortAggregate rather than HashAggregate. > {code:java} > test("HashAggregate canonicalization") { > val data = Seq((1, 1)).toDF("c0", "c1") > val df1 = data.groupBy(col("c0")).agg(first("c1")) > val df2 = data.groupBy(col("c0")).agg(first("c1")) > assert(df1.queryExecution.executedPlan.canonicalized == > df2.queryExecution.executedPlan.canonicalized) > } > test("SortAggregate canonicalization") { > val data = Seq(("a", "a")).toDF("c0", "c1") > val df1 = data.groupBy(col("c0")).agg(first("c1")) > val df2 = data.groupBy(col("c0")).agg(first("c1")) > assert(df1.queryExecution.executedPlan.canonicalized == > df2.queryExecution.executedPlan.canonicalized) > } {code} > The SortAggregate test fails with the following output . > {code:java} > SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, > #1]) > +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#105] > +- SortAggregate(key=[none#0], functions=[partial_first(none#1, > false)], output=[none#0, none#2, none#3]) > +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0 > +- *(1) Project [none#0 AS #0, none#1 AS #1] >+- *(1) LocalTableScan [none#0, none#1] > did not equal > SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, > #1]) > +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#148] > +- SortAggregate(key=[none#0], functions=[partial_first(none#1, > false)], output=[none#0, none#2, none#3]) > +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0 > +- *(1) Project [none#0 AS #0, none#1 AS #1] >+- *(1) LocalTableScan [none#0, none#1] {code} > The error is caused by the resultExpression for the aggregate function being > assigned a new ExprId in the final aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212242#comment-17212242 ] Pablo Langa Blanco commented on SPARK-33118: I'm working on it > CREATE TEMPORARY TABLE fails with location > -- > > Key: SPARK-33118 > URL: https://issues.apache.org/jira/browse/SPARK-33118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Pablo Langa Blanco >Priority: Major > > The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION > > {code:java} > spark.range(3).write.parquet("/data/tmp/testspark1") > spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path > '/data/tmp/testspark1')") > spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION > '/data/tmp/testspark1'") > {code} > The error message in both cases is > {code:java} > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) > at org.apache.spark.sql.Dataset.(Dataset.scala:229) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) > at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
[ https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-33118: --- Description: The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) {code} was: The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at
[jira] [Created] (SPARK-33118) CREATE TEMPORARY TABLE fails with location
Pablo Langa Blanco created SPARK-33118: -- Summary: CREATE TEMPORARY TABLE fails with location Key: SPARK-33118 URL: https://issues.apache.org/jira/browse/SPARK-33118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0 Reporter: Pablo Langa Blanco The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION {code:java} spark.range(3).write.parquet("/data/tmp/testspark1") spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/data/tmp/testspark1')") spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/data/tmp/testspark1'") {code} The error message in both cases is {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203733#comment-17203733 ] Pablo Langa Blanco commented on SPARK-32046: When you cache a Dataframe, you do not save the name of the dataframe, what is saved is a simplified (canonicalized) plan. Then, another different Dataframe, with the same canonicalized plan will re-use the cached dataframe. In our example: {code:java} scala> val df1 = spark.range(1).select(current_timestamp as "datetime") scala> df1.queryExecution.analyzed.canonicalized res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [current_timestamp() AS #0] +- Range (0, 1, step=1, splits=Some(4)) scala> val df2 = spark.range(1).select(current_timestamp as "datetime") scala> df2.queryExecution.analyzed.canonicalized Project [current_timestamp() AS #0] +- Range (0, 1, step=1, splits=Some(4)) scala> df1.queryExecution.analyzed.canonicalized == df2.queryExecution.analyzed.canonicalized res2: Boolean = true scala> df2.explain(true) == Optimized Logical Plan == Project [160136303378 AS datetime#6] +- Range (0, 1, step=1, splits=Some(4)) scala> df1.cache scala> df2.explain(true) == Optimized Logical Plan == InMemoryRelation [datetime#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [1601363046569000 AS datetime#2] +- *(1) Range (0, 1, step=1, splits=4) {code} What do you want to say with Java implementation? what is your actual work arround? Just for understanding if I miss something > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203345#comment-17203345 ] Pablo Langa Blanco commented on SPARK-32046: [~smilegator] why are we computing current time in the optimizer? I have seen this coment in the Optimizer code: {code:java} // Technically some of the rules in Finish Analysis are not optimizer rules and belong more // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime). // However, because we also use the analyzer to canonicalized queries (for view definition), // we do not eliminate subqueries or compute current time in the analyzer. {code} For the specific case of the caching, two plans with a current_timestamp in their plan, shouldn't they be considered as different plans? and have a different canonicalized plans. I think we have a strong reason for it, but in this case it result in a strange behavior Thanks! > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203040#comment-17203040 ] Pablo Langa Blanco edited comment on SPARK-32046 at 9/28/20, 4:20 PM: -- I have been looking at the problem and I think I have understood the problem. What I don't understand is why Jupyter and ZP behave differently. When a dataframe is cached, the key to the map that stores the cached objects is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second execution, when you go to check if the dataframe is cached you do the following check. {code:java} df1.queryExecution.analyzed.canonicalized == df2.queryExecution.analyzed.canonicalized{code} In this case, both execution plans are the same so it considers that it has the Dataframe cached and uses it It seems a rather strange case in real life to have 2 identical Dataframes, one cached and one not, have a timestamp and do not want to reuse the cached Dataframe was (Author: planga82): I have been looking at the problem and I think I have understood the problem. What I don't understand is why Jupyter and ZP behave differently. When a dataframe is cached, the key to the map that stores the cached objects is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second execution, when you go to check if the dataframe is cached you do the following check. df1.queryExecution.analyzed.canonicalized == df2.queryExecution.analyzed.canonicalized In this case, both execution plans are the same so it considers that it has the Dataframe cached and uses it It seems a rather strange case in real life to have 2 identical Dataframes, one cached and one not, have a timestamp and do not want to reuse the cached Dataframe > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203159#comment-17203159 ] Pablo Langa Blanco commented on SPARK-32046: Ok, I see, it make sense. I'll think about a possible solution but it seems difficult. The possible work arround is to cache a Dataframe with a different plan each time. I think in something like this {code:java} val df1 = spark.range(1).select(current_timestamp as "datetime").withColumn("name", lit("df1")) df1.cache val df2 = spark.range(1).select(current_timestamp as "datetime").withColumn("name", lit("df2")){code} > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203040#comment-17203040 ] Pablo Langa Blanco commented on SPARK-32046: I have been looking at the problem and I think I have understood the problem. What I don't understand is why Jupyter and ZP behave differently. When a dataframe is cached, the key to the map that stores the cached objects is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second execution, when you go to check if the dataframe is cached you do the following check. df1.queryExecution.analyzed.canonicalized == df2.queryExecution.analyzed.canonicalized In this case, both execution plans are the same so it considers that it has the Dataframe cached and uses it It seems a rather strange case in real life to have 2 identical Dataframes, one cached and one not, have a timestamp and do not want to reuse the cached Dataframe > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166168#comment-17166168 ] Pablo Langa Blanco commented on SPARK-32463: Thanks [~huaxingao], I'll take a look on this > Document Data Type inference rule in SQL reference > -- > > Key: SPARK-32463 > URL: https://issues.apache.org/jira/browse/SPARK-32463 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Document Data Type inference rule in SQL reference, under Data Types section. > Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer
[ https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141213#comment-17141213 ] Pablo Langa Blanco commented on SPARK-32025: I'm looking for the problem, as a workaround you can define the schema to avoid the bug on infer schema automatically > CSV schema inference with boolean & integer > > > Key: SPARK-32025 > URL: https://issues.apache.org/jira/browse/SPARK-32025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Brian Wallace >Priority: Major > > I have a dataset consisting of two small files in CSV format. > {code:bash} > $ cat /example/f0.csv > col1 > 8589934592 > $ cat /example/f1.csv > col1 > 4320 > true > {code} > > When I try and load this in (py)spark and infer schema, my expectation is > that the column is inferred to be a string. However, it is inferred as a > boolean: > {code:python} > spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, > multiLine=True).show() > ++ > |col1| > ++ > |null| > |true| > |null| > ++ > {code} > Note that this seems to work correctly if multiLine is set to False (although > we need to set it to True as this column may indeed span multiple lines in > general). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array
[ https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119319#comment-17119319 ] Pablo Langa Blanco commented on SPARK-31779: How about using array_zip? {code:java} // code placeholder val newTop = struct(df("top").getField("child1").alias("child1"), arrays_zip(df("top").getField("child2").getField("child2Second"),df("top").getField("child2").getField("child2Third")).alias("child2")) {code} > Redefining struct inside array incorrectly wraps child fields in array > -- > > Key: SPARK-31779 > URL: https://issues.apache.org/jira/browse/SPARK-31779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Jeff Evans >Priority: Major > > It seems that redefining a {{struct}} for the purpose of removing a > sub-field, when that {{struct}} is itself inside an {{array}}, results in the > remaining (non-removed) {{struct}} fields themselves being incorrectly > wrapped in an array. > For more context, see [this|https://stackoverflow.com/a/46084983/375670] > StackOverflow answer and discussion thread. I have debugged this code and > distilled it down to what I believe represents a bug in Spark itself. > Consider the following {{spark-shell}} session (version 2.4.5): > {code} > // use a nested JSON structure that contains a struct inside an array > val jsonData = """{ > "foo": "bar", > "top": { > "child1": 5, > "child2": [ > { > "child2First": "one", > "child2Second": 2 > } > ] > } > }""" > // read into a DataFrame > val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS()) > // create a new definition for "top", which will remove the > "top.child2.child2First" column > val newTop = struct(df("top").getField("child1").alias("child1"), > array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2")) > // show the schema before and after swapping out the struct definition > df.schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY>> > df.withColumn("top", newTop).schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY>>> > {code} > Notice in this case that the new definition for {{top.child2.child2Second}} > is an {{ARRAY}}. This is incorrect; it should simply be {{BIGINT}}. > There is nothing in the definition of the {{newTop}} {{struct}} that should > have caused the type to become wrapped in an array like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array
[ https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114866#comment-17114866 ] Pablo Langa Blanco commented on SPARK-31779: Hi [~jeff.w.evans] I think this is not a bug and this is the correct behavior. Your input has this structure: {code:java} root |-- foo: string (nullable = true) |-- top: struct (nullable = true) | |-- child1: long (nullable = true) | |-- child2: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- child2First: string (nullable = true) | | | |-- child2Second: long (nullable = true) {code} When you do this df("top").getField("child2") you have an array element than have an struct inside {code:java} child2: array (nullable = true) |-- element: struct (containsNull = true) | |-- child2First: string (nullable = true) | |-- child2Second: long (nullable = true) {code} When you do .getField("child2Second") over this structure you are accessing to the elements of the internal struct, but this is in an array so the return is an array with the element that you have selected {code:java} child2: array (nullable = true) |-- child2Second: long (nullable = true) {code} With this example I think it’s more clear {code:java} val jsonData = """{ | "foo": "bar", | "top": { | "child1": 5, | "child2": [ | { | "child2First": "one", | "child2Second": 2 | }, | { | "child2First": "", | "child2Second": 3 | } | ] | } | }""" val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS()) val newTop = df("top").getField("child2").getField("child2Second") df.withColumn("top2", newTop).printSchema root |-- foo: string (nullable = true) |-- top: struct (nullable = true) | |-- child1: long (nullable = true) | |-- child2: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- child2First: string (nullable = true) | | | |-- child2Second: long (nullable = true) |-- top2: array (nullable = true) | |-- element: long (containsNull = true) df.withColumn("top2", newTop).show(truncate=false) +---+--+--+ |foo|top |top2 | +---+--+--+ |bar|[5, [[one, 2], [, 3]]]|[2, 3]| +---+--+--+ {code} Looking at the code it is explicitly designed that way, so if you agree I think we should close the issue > Redefining struct inside array incorrectly wraps child fields in array > -- > > Key: SPARK-31779 > URL: https://issues.apache.org/jira/browse/SPARK-31779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Jeff Evans >Priority: Major > > It seems that redefining a {{struct}} for the purpose of removing a > sub-field, when that {{struct}} is itself inside an {{array}}, results in the > remaining (non-removed) {{struct}} fields themselves being incorrectly > wrapped in an array. > For more context, see [this|https://stackoverflow.com/a/46084983/375670] > StackOverflow answer and discussion thread. I have debugged this code and > distilled it down to what I believe represents a bug in Spark itself. > Consider the following {{spark-shell}} session (version 2.4.5): > {code} > // use a nested JSON structure that contains a struct inside an array > val jsonData = """{ > "foo": "bar", > "top": { > "child1": 5, > "child2": [ > { > "child2First": "one", > "child2Second": 2 > } > ] > } > }""" > // read into a DataFrame > val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS()) > // create a new definition for "top", which will remove the > "top.child2.child2First" column > val newTop = struct(df("top").getField("child1").alias("child1"), > array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2")) > // show the schema before and after swapping out the struct definition > df.schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY>> > df.withColumn("top", newTop).schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY>>> > {code} > Notice in this case that the new definition for {{top.child2.child2Second}} > is an {{ARRAY}}. This is incorrect; it should simply be {{BIGINT}}. > There is nothing in the definition of the {{newTop}} {{struct}} that should > have caused the type to become wrapped in an array like this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (SPARK-31561) Add QUALIFY Clause
[ https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco updated SPARK-31561: --- Comment: was deleted (was: I'm working on this) > Add QUALIFY Clause > -- > > Key: SPARK-31561 > URL: https://issues.apache.org/jira/browse/SPARK-31561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > In a SELECT statement, the QUALIFY clause filters the results of window > functions. > QUALIFY does with window functions what HAVING does with aggregate functions > and GROUP BY clauses. > In the execution order of a query, QUALIFY is therefore evaluated after > window functions are computed. > Examples: > https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples > More details: > https://docs.snowflake.com/en/sql-reference/constructs/qualify.html > https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31561) Add QUALIFY Clause
[ https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097360#comment-17097360 ] Pablo Langa Blanco commented on SPARK-31561: I'm working on this > Add QUALIFY Clause > -- > > Key: SPARK-31561 > URL: https://issues.apache.org/jira/browse/SPARK-31561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > In a SELECT statement, the QUALIFY clause filters the results of window > functions. > QUALIFY does with window functions what HAVING does with aggregate functions > and GROUP BY clauses. > In the execution order of a query, QUALIFY is therefore evaluated after > window functions are computed. > Examples: > https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples > More details: > https://docs.snowflake.com/en/sql-reference/constructs/qualify.html > https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31574) Schema evolution in spark while using the storage format as parquet
[ https://issues.apache.org/jira/browse/SPARK-31574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096805#comment-17096805 ] Pablo Langa Blanco commented on SPARK-31574: Spark have a functionality to read multiple parquet files with different *compatible* schemas and by default is disabled. [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging] The problem in the example you propose is that int and string are incompatible data types so merge schema is not going to work > Schema evolution in spark while using the storage format as parquet > --- > > Key: SPARK-31574 > URL: https://issues.apache.org/jira/browse/SPARK-31574 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: sharad Gupta >Priority: Major > > Hi Team, > > Use case: > Suppose there is a table T1 with column C1 with datatype as int in schema > version 1. In the first on boarding table T1. I wrote couple of parquet files > with this schema version 1 with underlying file format used parquet. > Now in schema version 2 the C1 column datatype changed to string from int. > Now It will write data with schema version 2 in parquet. > So some parquet files are written with schema version 1 and some written with > schema version 2. > Problem statement : > 1. We are not able to execute the below command from spark sql > ```Alter table Table T1 change C1 C1 string``` > 2. So as a solution i goto hive and alter the table change datatype because > it supported in hive then try to read the data in spark. So it is giving me > error > ``` > Caused by: java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51) > at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)``` > > 3. Suspecting that the underlying parquet file is written with integer type > and we are reading from a table whose column is changed to string type. So > that is why it is happening. > How you can reproduce this: > spark sql > 1. Create a table from spark sql with one column with datatype as int with > stored as parquet. > 2. Now put some data into table. > 3. Now you can see the data if you select from table. > Hive > 1. change datatype from int to string by alter command > 2. Now try to read data, You will be able to read the data here even after > changing the datatype. > spark sql > 1. Try to read data from here now you will see the error. > Now the question is how to solve schema evolution in spark while using the > storage format as parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation
[ https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096188#comment-17096188 ] Pablo Langa Blanco commented on SPARK-31566: Ok, perfect, thank you! > Add SQL Rest API Documentation > -- > > Key: SPARK-31566 > URL: https://issues.apache.org/jira/browse/SPARK-31566 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > > SQL Rest API exposes query execution metrics as Public API. Its documentation > will be useful for end-users. > {code:java} > /applications/[app-id]/sql > 1- A list of all queries for a given application. > 2- ?details=[true|false (default)] lists metric details in addition to > queries details. > 3- ?offset=[offset]=[len] lists queries in the given range.{code} > {code:java} > /applications/[app-id]/sql/[execution-id] > 1- Details for the given query. > 2- ?details=[true|false (default)] lists metric details in addition to given > query details.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation
[ https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095881#comment-17095881 ] Pablo Langa Blanco commented on SPARK-31566: I'm taking a look on this. I think is useful to add this documentation > Add SQL Rest API Documentation > -- > > Key: SPARK-31566 > URL: https://issues.apache.org/jira/browse/SPARK-31566 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Priority: Major > > SQL Rest API exposes query execution metrics as Public API. Its documentation > will be useful for end-users. > {code:java} > /applications/[app-id]/sql > 1- A list of all queries for a given application. > 2- ?details=[true|false (default)] lists metric details in addition to > queries details. > 3- ?offset=[offset]=[len] lists queries in the given range.{code} > {code:java} > /applications/[app-id]/sql/[execution-id] > 1- Details for the given query. > 2- ?details=[true|false (default)] lists metric details in addition to given > query details.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements
[ https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092818#comment-17092818 ] Pablo Langa Blanco commented on SPARK-31500: [https://github.com/apache/spark/pull/28351] > collect_set() of BinaryType returns duplicate elements > -- > > Key: SPARK-31500 > URL: https://issues.apache.org/jira/browse/SPARK-31500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 2.4.5 >Reporter: Eric Wasserman >Priority: Major > > The collect_set() aggregate function should produce a set of distinct > elements. When the column argument's type is BinayType this is not the case. > > Example: > {{import org.apache.spark.sql.functions._}} > {{import org.apache.spark.sql.expressions.Window}} > {{case class R(id: String, value: String, bytes: Array[Byte])}} > {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}} > {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), > makeR("b", "fish")).toDF()}} > > {{// In the example below "bytesSet" erroneously has duplicates but > "stringSet" does not (as expected).}} > {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as > "byteSet").show(truncate=false)}} > > {{// The same problem is displayed when using window functions.}} > {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing)}} > {{val result = df.select(}} > collect_set('value).over(win) as "stringSet", > collect_set('bytes).over(win) as "bytesSet" > {{)}} > {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", > size('bytesSet) as "bytesSetSize")}} > {{.show()}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements
[ https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091833#comment-17091833 ] Pablo Langa Blanco commented on SPARK-31500: Hi [~ewasserman], This is a scala base problem, equality between arrays is not behaving as expected. [https://blog.bruchez.name/2013/05/scala-array-comparison-without-phd.html] I'm going to work to find a solution, but here is a workaround, change the definition of the case class and put Seq instead of Array and it will work as expected. {code:java} case class R(id: String, value: String, bytes: Seq[Byte]){code} > collect_set() of BinaryType returns duplicate elements > -- > > Key: SPARK-31500 > URL: https://issues.apache.org/jira/browse/SPARK-31500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 2.4.5 >Reporter: Eric Wasserman >Priority: Major > > The collect_set() aggregate function should produce a set of distinct > elements. When the column argument's type is BinayType this is not the case. > > Example: > {{import org.apache.spark.sql.functions._}} > {{import org.apache.spark.sql.expressions.Window}} > {{case class R(id: String, value: String, bytes: Array[Byte])}} > {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}} > {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), > makeR("b", "fish")).toDF()}} > > {{// In the example below "bytesSet" erroneously has duplicates but > "stringSet" does not (as expected).}} > {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as > "byteSet").show(truncate=false)}} > > {{// The same problem is displayed when using window functions.}} > {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing)}} > {{val result = df.select(}} > collect_set('value).over(win) as "stringSet", > collect_set('bytes).over(win) as "bytesSet" > {{)}} > {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", > size('bytesSet) as "bytesSetSize")}} > {{.show()}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation
Pablo Langa Blanco created SPARK-31435: -- Summary: Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation Key: SPARK-31435 URL: https://issues.apache.org/jira/browse/SPARK-31435 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.1.0 Reporter: Pablo Langa Blanco Related with SPARK-31432 That issue introduces new environment variable that is documented in this issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082233#comment-17082233 ] Pablo Langa Blanco commented on SPARK-31435: I'm working on this > Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration > documentation > - > > Key: SPARK-31435 > URL: https://issues.apache.org/jira/browse/SPARK-31435 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > Related with SPARK-31432 > That issue introduces new environment variable that is documented in this > issue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995945#comment-16995945 ] Pablo Langa Blanco commented on SPARK-30077: [~huaxingao] Can we close this ticket? Reading the comments I undestand we are not going to do this change. Thanks > create TEMPORARY VIEW USING should look up catalog/table like v2 commands > - > > Key: SPARK-30077 > URL: https://issues.apache.org/jira/browse/SPARK-30077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > create TEMPORARY VIEW USING should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29563) CREATE TABLE LIKE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995942#comment-16995942 ] Pablo Langa Blanco commented on SPARK-29563: [~dkbiswal] Are you still working on this? If not, I can continue. Thanks > CREATE TABLE LIKE should look up catalog/table like v2 commands > --- > > Key: SPARK-29563 > URL: https://issues.apache.org/jira/browse/SPARK-29563 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30040) DROP FUNCTION should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982215#comment-16982215 ] Pablo Langa Blanco commented on SPARK-30040: I'm working on this > DROP FUNCTION should look up catalog/table like v2 commands > --- > > Key: SPARK-30040 > URL: https://issues.apache.org/jira/browse/SPARK-30040 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > DROP FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30040) DROP FUNCTION should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-30040: -- Summary: DROP FUNCTION should look up catalog/table like v2 commands Key: SPARK-30040 URL: https://issues.apache.org/jira/browse/SPARK-30040 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco DROP FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982214#comment-16982214 ] Pablo Langa Blanco commented on SPARK-30039: I am working on this > CREATE FUNCTION should look up catalog/table like v2 commands > -- > > Key: SPARK-30039 > URL: https://issues.apache.org/jira/browse/SPARK-30039 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > CREATE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-30039: -- Summary: CREATE FUNCTION should look up catalog/table like v2 commands Key: SPARK-30039 URL: https://issues.apache.org/jira/browse/SPARK-30039 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco CREATE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982213#comment-16982213 ] Pablo Langa Blanco commented on SPARK-30038: I am working on this > DESCRIBE FUNCTION should look up catalog/table like v2 commands > --- > > Key: SPARK-30038 > URL: https://issues.apache.org/jira/browse/SPARK-30038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > DESCRIBE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-30038: -- Summary: DESCRIBE FUNCTION should look up catalog/table like v2 commands Key: SPARK-30038 URL: https://issues.apache.org/jira/browse/SPARK-30038 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco DESCRIBE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-29922: -- Summary: SHOW FUNCTIONS should look up catalog/table like v2 commands Key: SPARK-29922 URL: https://issues.apache.org/jira/browse/SPARK-29922 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco SHOW FUNCTIONS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975272#comment-16975272 ] Pablo Langa Blanco commented on SPARK-29922: I'm working on this > SHOW FUNCTIONS should look up catalog/table like v2 commands > > > Key: SPARK-29922 > URL: https://issues.apache.org/jira/browse/SPARK-29922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > SHOW FUNCTIONS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971191#comment-16971191 ] Pablo Langa Blanco commented on SPARK-29829: I am working on this > SHOW TABLE EXTENDED should look up catalog/table like v2 commands > - > > Key: SPARK-29829 > URL: https://issues.apache.org/jira/browse/SPARK-29829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > SHOW TABLE EXTENDED should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-29829: -- Summary: SHOW TABLE EXTENDED should look up catalog/table like v2 commands Key: SPARK-29829 URL: https://issues.apache.org/jira/browse/SPARK-29829 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco SHOW TABLE EXTENDED should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29523) SHOW COLUMNS should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955422#comment-16955422 ] Pablo Langa Blanco commented on SPARK-29523: I am working on this > SHOW COLUMNS should look up catalog/table like v2 commands > -- > > Key: SPARK-29523 > URL: https://issues.apache.org/jira/browse/SPARK-29523 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > SHOW COLUMNS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29523) SHOW COLUMNS should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-29523: -- Summary: SHOW COLUMNS should look up catalog/table like v2 commands Key: SPARK-29523 URL: https://issues.apache.org/jira/browse/SPARK-29523 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco SHOW COLUMNS should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955207#comment-16955207 ] Pablo Langa Blanco commented on SPARK-29519: I am working on this > SHOW TBLPROPERTIES should look up catalog/table like v2 commands > > > Key: SPARK-29519 > URL: https://issues.apache.org/jira/browse/SPARK-29519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Major > > SHOW TBLPROPERTIES should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands
Pablo Langa Blanco created SPARK-29519: -- Summary: SHOW TBLPROPERTIES should look up catalog/table like v2 commands Key: SPARK-29519 URL: https://issues.apache.org/jira/browse/SPARK-29519 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco SHOW TBLPROPERTIES should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29433) Web UI Stages table tooltip correction
Pablo Langa Blanco created SPARK-29433: -- Summary: Web UI Stages table tooltip correction Key: SPARK-29433 URL: https://issues.apache.org/jira/browse/SPARK-29433 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco In the Web UI, Stages table, the tool tip of Input and output column are not corrrect. Actual tooltip messages: * Bytes and records read from Hadoop or from Spark storage. * Bytes and records written to Hadoop. In this column we are only showing bytes, not records More information at the pull request -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29431) Improve Web UI / Sql tab visualization with cached dataframes.
Pablo Langa Blanco created SPARK-29431: -- Summary: Improve Web UI / Sql tab visualization with cached dataframes. Key: SPARK-29431 URL: https://issues.apache.org/jira/browse/SPARK-29431 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco When the Spark plan has a cached dataframe, all the plan of the cached dataframe is not been shown in the SQL tree. More info at the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29019) Improve tooltip information in JDBC/ODBC Server tab
Pablo Langa Blanco created SPARK-29019: -- Summary: Improve tooltip information in JDBC/ODBC Server tab Key: SPARK-29019 URL: https://issues.apache.org/jira/browse/SPARK-29019 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand. We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns More information at the pull request -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28372) Document Spark WEB UI
[ https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921026#comment-16921026 ] Pablo Langa Blanco commented on SPARK-28372: [~smilegator] We have a streaming tab section in the documentation with little documentation. Do you think we should open a new issue to complete this? > Document Spark WEB UI > - > > Key: SPARK-28372 > URL: https://issues.apache.org/jira/browse/SPARK-28372 > Project: Spark > Issue Type: Umbrella > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Spark web UIs are being used to monitor the status and resource consumption > of your Spark applications and clusters. However, we do not have the > corresponding document. It is hard for end users to use and understand them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921019#comment-16921019 ] Pablo Langa Blanco commented on SPARK-28373: [~podongfeng] Ok i can take care of it > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28542) Document Stages page
[ https://issues.apache.org/jira/browse/SPARK-28542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909985#comment-16909985 ] Pablo Langa Blanco commented on SPARK-28542: I want to contribution on this if you are not working on it [~podongfeng] > Document Stages page > > > Key: SPARK-28542 > URL: https://issues.apache.org/jira/browse/SPARK-28542 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28543) Document Spark Jobs page
[ https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905988#comment-16905988 ] Pablo Langa Blanco commented on SPARK-28543: Hi [~podongfeng], i have seen it when i do the pull resquest yesterday. The new information is integrated in your main page. Thanks!! > Document Spark Jobs page > > > Key: SPARK-28543 > URL: https://issues.apache.org/jira/browse/SPARK-28543 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28543) Document Spark Jobs page
[ https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899482#comment-16899482 ] Pablo Langa Blanco commented on SPARK-28543: Hi! I want to contribute on this. > Document Spark Jobs page > > > Key: SPARK-28543 > URL: https://issues.apache.org/jira/browse/SPARK-28543 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan
[ https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco closed SPARK-28585. -- > Improve WebUI DAG information: Add extra info to rdd from spark plan > > > Key: SPARK-28585 > URL: https://issues.apache.org/jira/browse/SPARK-28585 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > The mainly improve that i want to achieve is to help developers to explore > the DAG information in the Web UI in complex flows. > Sometimes is very dificult to know what part of your spark plan corresponds > to the DAG you are looking. > > This is an initial improvement only in one simple spark plan type > (UnionExec). > If you consider it a good idea, i want to extend it to other spark plans to > improve the visualization iteratively. > > More info in the pull request -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan
[ https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Langa Blanco resolved SPARK-28585. Resolution: Won't Fix > Improve WebUI DAG information: Add extra info to rdd from spark plan > > > Key: SPARK-28585 > URL: https://issues.apache.org/jira/browse/SPARK-28585 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Priority: Minor > > The mainly improve that i want to achieve is to help developers to explore > the DAG information in the Web UI in complex flows. > Sometimes is very dificult to know what part of your spark plan corresponds > to the DAG you are looking. > > This is an initial improvement only in one simple spark plan type > (UnionExec). > If you consider it a good idea, i want to extend it to other spark plans to > improve the visualization iteratively. > > More info in the pull request -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan
Pablo Langa Blanco created SPARK-28585: -- Summary: Improve WebUI DAG information: Add extra info to rdd from spark plan Key: SPARK-28585 URL: https://issues.apache.org/jira/browse/SPARK-28585 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 3.0.0 Reporter: Pablo Langa Blanco The mainly improve that i want to achieve is to help developers to explore the DAG information in the Web UI in complex flows. Sometimes is very dificult to know what part of your spark plan corresponds to the DAG you are looking. This is an initial improvement only in one simple spark plan type (UnionExec). If you consider it a good idea, i want to extend it to other spark plans to improve the visualization iteratively. More info in the pull request -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26648) Noop Streaming Sink using DSV2
[ https://issues.apache.org/jira/browse/SPARK-26648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745506#comment-16745506 ] Pablo Langa Blanco commented on SPARK-26648: Duplicated with SPARK-26649? > Noop Streaming Sink using DSV2 > -- > > Key: SPARK-26648 > URL: https://issues.apache.org/jira/browse/SPARK-26648 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Extend this noop data source to support a streaming sink that ignores all the > elements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24701) SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather than different ports per app
[ https://issues.apache.org/jira/browse/SPARK-24701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736366#comment-16736366 ] Pablo Langa Blanco commented on SPARK-24701: Hi [~toopt4] i think you have this functionality in the history server. For each application spark starts new server (in a new port) and you have the history server up only in one port monitoring all the applications. As you can see ([https://spark.apache.org/docs/latest/monitoring.html)] "The history server displays both completed and incomplete Spark jobs" "Incomplete applications are only updated intermittently" spark.history.fs.update.interval it could be a solution for you? > SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather > than different ports per app > - > > Key: SPARK-24701 > URL: https://issues.apache.org/jira/browse/SPARK-24701 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.1 >Reporter: t oo >Priority: Major > Labels: master, security, ui, web, web-ui > Attachments: spark_ports.png > > > Right now the detail for all application ids are shown on a diff port per app > id, ie. 4040, 4041, 4042...etc this is problematic for environments with > tight firewall settings. Proposing to allow 4040?appid=1, 4040?appid=2, > 4040?appid=3..etc for the master web ui just like what the History Web UI > does. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement
[ https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736087#comment-16736087 ] Pablo Langa Blanco commented on SPARK-26518: i was trying to find and easy solution to this but i dont find it. The problem is not only in that point because when you control this error thre are other elements that are not loaded yet and fails too. Other thing that i was trying to do is to find a point to know if the KVStore is load and show a message instead the error, but its no easy. So, as it is an issue whit trivial priority its no reasonable to change a lot of things. > UI Application Info Race Condition Can Throw NoSuchElement > -- > > Key: SPARK-26518 > URL: https://issues.apache.org/jira/browse/SPARK-26518 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Russell Spitzer >Priority: Trivial > > There is a slight race condition in the > [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39] > Which calls `next` on the returned store even if it is empty which i can be > for a short period of time after the UI is up but before the store is > populated. > {code} > > > Error 500 Server Error > > HTTP ERROR 500 > Problem accessing /jobs/. Reason: > Server ErrorCaused > by:java.util.NoSuchElementException > at java.util.Collections$EmptyIterator.next(Collections.java:4189) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281) > at > org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38) > at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86) > at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.spark_project.jetty.server.Server.handle(Server.java:531) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102) > at > org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab
[ https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736024#comment-16736024 ] Pablo Langa Blanco commented on SPARK-26457: Ok, I don't know that you are working on it. i'm going to comment on your pull request Thanks!! > Show hadoop configurations in HistoryServer environment tab > --- > > Key: SPARK-26457 > URL: https://issues.apache.org/jira/browse/SPARK-26457 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.3.2, 2.4.0 > Environment: Maybe it is good to show some configurations in > HistoryServer environment tab for debugging some bugs about hadoop >Reporter: deshanxiao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement
[ https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735135#comment-16735135 ] Pablo Langa Blanco commented on SPARK-26518: Hi I'm looking into it. Pablo > UI Application Info Race Condition Can Throw NoSuchElement > -- > > Key: SPARK-26518 > URL: https://issues.apache.org/jira/browse/SPARK-26518 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Russell Spitzer >Priority: Trivial > > There is a slight race condition in the > [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39] > Which calls `next` on the returned store even if it is empty which i can be > for a short period of time after the UI is up but before the store is > populated. > {code} > > > Error 500 Server Error > > HTTP ERROR 500 > Problem accessing /jobs/. Reason: > Server ErrorCaused > by:java.util.NoSuchElementException > at java.util.Collections$EmptyIterator.next(Collections.java:4189) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281) > at > org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38) > at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86) > at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) > at > org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.spark_project.jetty.server.Server.handle(Server.java:531) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102) > at > org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab
[ https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734986#comment-16734986 ] Pablo Langa Blanco commented on SPARK-26457: Hi [~deshanxiao], What configurations are you thinking about? Could you explain cases where this information could be relevant. I'm thinking in the case that you are working with yarn, yarn has all the information about hadoop that we could need about the spark job, you dont need it duplicated in History server. Thanks! > Show hadoop configurations in HistoryServer environment tab > --- > > Key: SPARK-26457 > URL: https://issues.apache.org/jira/browse/SPARK-26457 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.3.2, 2.4.0 > Environment: Maybe it is good to show some configurations in > HistoryServer environment tab for debugging some bugs about hadoop >Reporter: deshanxiao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25917) Spark UI's executors page loads forever when memoryMetrics in None. Fix is to JSON ignore memorymetrics when it is None.
[ https://issues.apache.org/jira/browse/SPARK-25917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734984#comment-16734984 ] Pablo Langa Blanco commented on SPARK-25917: The pull request was closed because the problem has been solved already, the issue should be closed too. > Spark UI's executors page loads forever when memoryMetrics in None. Fix is to > JSON ignore memorymetrics when it is None. > > > Key: SPARK-25917 > URL: https://issues.apache.org/jira/browse/SPARK-25917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: Rong Tang >Priority: Major > > Spark UI's executors page loads forever when memoryMetrics in None. Fix is to > JSON ignore memorymetrics when it is None. > ## How was this patch tested? > Before fix: (loads forever) > ![image](https://user-images.githubusercontent.com/1785565/47875681-64dfe480-ddd4-11e8-8d15-5ed1457bc24f.png) > After fix: > ![image](https://user-images.githubusercontent.com/1785565/47875691-6b6e5c00-ddd4-11e8-9895-db8dd9730ee1.png) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org