[jira] [Commented] (SPARK-46778) get_json_object flattens wildcard queries that match a single value

2024-03-04 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823371#comment-17823371
 ] 

Pablo Langa Blanco commented on SPARK-46778:


I was looking at it and I found a comment in the code that explain why this 
behavior 
([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377])
 

There are some tests around the code that test it and I reproduced it in hive 
3.1.3 and it still maintains this behavior so I don't know if we can change it.

> get_json_object flattens wildcard queries that match a single value
> ---
>
> Key: SPARK-46778
> URL: https://issues.apache.org/jira/browse/SPARK-46778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Robert Joseph Evans
>Priority: Major
>
> I think this impacts all versions of {{{}get_json_object{}}}, but I am not 
> 100% sure.
> The unit test for 
> [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146]
>  verifies that the output of this query is a single level JSON array, but 
> when I put the same JSON and JSON path into [http://jsonpath.com/] I get a 
> result with multiple levels of nesting. It looks like Apache Spark tries to 
> flatten lists for {{[*]}} matches when there is only a single element that 
> matches.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
> ++
> |get_json_object(jsonStr, $[*].a)|
> ++
> |"A"                             |
> |["A","B"]                       |
> ++ {code}
> But this has problems in that I no longer have a consistent schema returned, 
> even if the input schema is known to be consistent. For example if I wanted 
> to know how many elements matched this query I could wrap it in a 
> {{json_array_length}} but that does not work in the generic case.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---+
> |null                                               |
> |2                                                  |
> +---+ {code}
> If the value returned might be a JSON array, and then I would get a number, 
> but it is wrong.
> {code:java}
> scala> 
> Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---+
> |5                                                  |
> |2                                                  |
> +---+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-27 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821273#comment-17821273
 ] 

Pablo Langa Blanco commented on SPARK-47063:


Ok, personally I would prefer the truncation too, I'll make a PR with it and we 
can discuss it there.

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> 

[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

2024-02-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820853#comment-17820853
 ] 

Pablo Langa Blanco commented on SPARK-47063:


[~revans2] are you working on the fix?

> CAST long to timestamp has different behavior for codegen vs interpreted
> 
>
> Key: SPARK-47063
> URL: https://issues.apache.org/jira/browse/SPARK-47063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: Robert Joseph Evans
>Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++---+---+
> |v                   |ts                 |unix_micros(ts)|
> ++---+---+
> |9223372036854775807 |1969-12-31 23:59:59|-100       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|199000     |
> ++---+---+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> ++-++
> |v                   |ts                           |unix_micros(ts)     |
> ++-++
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |199000          |
> ++-++
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
> res12: Long = 9223372036854775807
> scala> Long.MaxValue 

[jira] [Created] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema

2024-02-21 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-47123:
--

 Summary: JDBCRDD does not correctly handle errors in 
getQueryOutputSchema
 Key: SPARK-47123
 URL: https://issues.apache.org/jira/browse/SPARK-47123
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0, 4.0.0
Reporter: Pablo Langa Blanco


If there is an error executing statement.executeQuery(), it's possible that 
another error in one of the finally statements makes us not see the main error.
{code:java}
def getQueryOutputSchema(
      query: String, options: JDBCOptions, dialect: JdbcDialect): StructType = {
    val conn: Connection = dialect.createConnectionFactory(options)(-1)
    try {
      val statement = conn.prepareStatement(query)
      try {
        statement.setQueryTimeout(options.queryTimeout)
        val rs = statement.executeQuery()
        try {
          JdbcUtils.getSchema(rs, dialect, alwaysNullable = true,
            isTimestampNTZ = options.preferTimestampNTZ)
        } finally {
          rs.close()
        }
      } finally {
        statement.close()
      }
    } finally {
      conn.close()
    }
  } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47123) JDBCRDD does not correctly handle errors in getQueryOutputSchema

2024-02-21 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819413#comment-17819413
 ] 

Pablo Langa Blanco commented on SPARK-47123:


I'm working on it

> JDBCRDD does not correctly handle errors in getQueryOutputSchema
> 
>
> Key: SPARK-47123
> URL: https://issues.apache.org/jira/browse/SPARK-47123
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> If there is an error executing statement.executeQuery(), it's possible that 
> another error in one of the finally statements makes us not see the main 
> error.
> {code:java}
> def getQueryOutputSchema(
>       query: String, options: JDBCOptions, dialect: JdbcDialect): StructType 
> = {
>     val conn: Connection = dialect.createConnectionFactory(options)(-1)
>     try {
>       val statement = conn.prepareStatement(query)
>       try {
>         statement.setQueryTimeout(options.queryTimeout)
>         val rs = statement.executeQuery()
>         try {
>           JdbcUtils.getSchema(rs, dialect, alwaysNullable = true,
>             isTimestampNTZ = options.preferTimestampNTZ)
>         } finally {
>           rs.close()
>         }
>       } finally {
>         statement.close()
>       }
>     } finally {
>       conn.close()
>     }
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45039) Include full identifier in Storage tab

2023-09-01 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761295#comment-17761295
 ] 

Pablo Langa Blanco commented on SPARK-45039:


https://github.com/apache/spark/pull/42759

> Include full identifier in Storage tab
> --
>
> Key: SPARK-45039
> URL: https://issues.apache.org/jira/browse/SPARK-45039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
> Attachments: image-2023-08-31-18-38-37-856.png
>
>
> If you have 2 databases with tables with the same name like:  db1.table, 
> db2.table
>  
> {code:java}
> scala> spark.sql("use db1")
> scala> spark.sql("cache table table")
>    
> scala> spark.sql("use db2")
> scala> spark.sql("cache table table"){code}
> You cannot differentiate it in the Storage tab
> !image-2023-08-31-18-38-37-856.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45039) Include full identifier in Storage tab

2023-08-31 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761108#comment-17761108
 ] 

Pablo Langa Blanco commented on SPARK-45039:


I'm working on it

> Include full identifier in Storage tab
> --
>
> Key: SPARK-45039
> URL: https://issues.apache.org/jira/browse/SPARK-45039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
> Attachments: image-2023-08-31-18-38-37-856.png
>
>
> If you have 2 databases with tables with the same name like:  db1.table, 
> db2.table
>  
> {code:java}
> scala> spark.sql("use db1")
> scala> spark.sql("cache table table")
>    
> scala> spark.sql("use db2")
> scala> spark.sql("cache table table"){code}
> You cannot differentiate it in the Storage tab
> !image-2023-08-31-18-38-37-856.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45039) Include full identifier in Storage tab

2023-08-31 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-45039:
---
Description: 
If you have 2 databases with tables with the same name like:  db1.table, 
db2.table

 
{code:java}
scala> spark.sql("use db1")

scala> spark.sql("cache table table")
   
scala> spark.sql("use db2")

scala> spark.sql("cache table table"){code}
You cannot differentiate it in the Storage tab

!image-2023-08-31-18-38-37-856.png!

 

  was:
If you have 2 databases with tables with the same name like:  db1.table, 
db2.table

 
{code:java}
scala> spark.sql("use db1")

scala> spark.sql("cache table table")
   
scala> spark.sql("use db2")

scala> spark.sql("cache table table"){code}
You cannot differentiate it in the Storage tab

!image-2023-08-31-18-37-48-330.png!

 


> Include full identifier in Storage tab
> --
>
> Key: SPARK-45039
> URL: https://issues.apache.org/jira/browse/SPARK-45039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
> Attachments: image-2023-08-31-18-38-37-856.png
>
>
> If you have 2 databases with tables with the same name like:  db1.table, 
> db2.table
>  
> {code:java}
> scala> spark.sql("use db1")
> scala> spark.sql("cache table table")
>    
> scala> spark.sql("use db2")
> scala> spark.sql("cache table table"){code}
> You cannot differentiate it in the Storage tab
> !image-2023-08-31-18-38-37-856.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45039) Include full identifier in Storage tab

2023-08-31 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-45039:
---
Attachment: image-2023-08-31-18-38-37-856.png

> Include full identifier in Storage tab
> --
>
> Key: SPARK-45039
> URL: https://issues.apache.org/jira/browse/SPARK-45039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 4.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
> Attachments: image-2023-08-31-18-38-37-856.png
>
>
> If you have 2 databases with tables with the same name like:  db1.table, 
> db2.table
>  
> {code:java}
> scala> spark.sql("use db1")
> scala> spark.sql("cache table table")
>    
> scala> spark.sql("use db2")
> scala> spark.sql("cache table table"){code}
> You cannot differentiate it in the Storage tab
> !image-2023-08-31-18-37-48-330.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45039) Include full identifier in Storage tab

2023-08-31 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-45039:
--

 Summary: Include full identifier in Storage tab
 Key: SPARK-45039
 URL: https://issues.apache.org/jira/browse/SPARK-45039
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 4.0.0
Reporter: Pablo Langa Blanco
 Attachments: image-2023-08-31-18-38-37-856.png

If you have 2 databases with tables with the same name like:  db1.table, 
db2.table

 
{code:java}
scala> spark.sql("use db1")

scala> spark.sql("cache table table")
   
scala> spark.sql("use db2")

scala> spark.sql("cache table table"){code}
You cannot differentiate it in the Storage tab

!image-2023-08-31-18-37-48-330.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44500) parse_url treats key as regular expression

2023-07-24 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746705#comment-17746705
 ] 

Pablo Langa Blanco commented on SPARK-44500:


[~jan.chou...@gmail.com] What do you think?

> parse_url treats key as regular expression
> --
>
> Key: SPARK-44500
> URL: https://issues.apache.org/jira/browse/SPARK-44500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.4.1
>Reporter: Robert Joseph Evans
>Priority: Major
>
> To be clear I am not 100% sure that this is a bug. It might be a feature, but 
> I don't see anywhere that it is used as a feature. If it is a feature it 
> really should be documented, because there are pitfalls. If it is a bug it 
> should be fixed because it is really confusing and it is simple to shoot 
> yourself in the foot.
> ```scala
> > val urls = Seq("http://foo/bar?abc=BAD=GOOD;, 
> > "http://foo/bar?a.c=GOOD=BAD;).toDF
> > urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false)
> ++
> |parse_url(value, QUERY, a.c)|
> ++
> |BAD |
> |GOOD|
> ++
> > urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false)
> java.util.regex.PatternSyntaxException: Unclosed character class near index 15
> (&|^)a[c=([^&]*)
>^
>   at java.util.regex.Pattern.error(Pattern.java:1969)
>   at java.util.regex.Pattern.clazz(Pattern.java:2562)
>   at java.util.regex.Pattern.sequence(Pattern.java:2077)
>   at java.util.regex.Pattern.expr(Pattern.java:2010)
>   at java.util.regex.Pattern.compile(Pattern.java:1702)
>   at java.util.regex.Pattern.(Pattern.java:1352)
>   at java.util.regex.Pattern.compile(Pattern.java:1028)
> ```
> The simple fix is to quote the key when making the pattern.
> ```scala
>   private def getPattern(key: UTF8String): Pattern = {
> Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX)
>   }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40679) add named_struct to spark scala api for parity to spark sql

2023-02-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694757#comment-17694757
 ] 

Pablo Langa Blanco commented on SPARK-40679:


I think the original development supports the creation of NamedStructs in the 
dataframe interface.

[https://github.com/apache/spark/pull/6874/files]

 
{code:java}
/**
   * Creates a new struct column.
   * If the input column is a column in a [[DataFrame]], or a derived column 
expression
   * that is named (i.e. aliased), its name would be remained as the 
StructField's name,
   * otherwise, the newly generated StructField's name would be auto generated 
as col${index + 1},
   * i.e. col1, col2, col3, ...
*/
def struct(cols: Column*): Column = {
...{code}
 
{code:java}
test("struct with column expression to be automatically named") {
    val df = Seq((1, "str")).toDF("a", "b")
    val result = df.select(struct((col("a") * 2), col("b")))
    val expectedType = StructType(Seq(
      StructField("col1", IntegerType, nullable = false),
      StructField("b", StringType)
    ))
    assert(result.first.schema(0).dataType === expectedType)
    checkAnswer(result, Row(Row(2, "str")))
  } 
{code}
 

 

 

> add named_struct to spark scala api for parity to spark sql
> ---
>
> Key: SPARK-40679
> URL: https://issues.apache.org/jira/browse/SPARK-40679
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Douglas Moore
>Priority: Major
>
> To facilitate migration from cast(struct( to `named_struct` where 
> comparison of structures is needed.
> To maintain parity between APIs. Understand that using `expr` would work, but 
> it wouldn't be typed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42407) `with as` executed again

2023-02-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693723#comment-17693723
 ] 

Pablo Langa Blanco commented on SPARK-42407:


In my opinion, "WITH AS" syntax is intended to simplify sql queries, but not to 
act at the execution level. To get what you want you can use "CACHE TABLE" 
combined with 'WITH AS'.

> `with as` executed again
> 
>
> Key: SPARK-42407
> URL: https://issues.apache.org/jira/browse/SPARK-42407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: yiku123
>Priority: Major
>
> When 'with as' is used multiple times, it will be executed again each time 
> without saving the results of' with as', resulting in low efficiency.
> Will you consider improving the behavior of 'with as'
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2023-02-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721
 ] 

Pablo Langa Blanco edited comment on SPARK-40525 at 2/26/23 10:57 PM:
--

Hi [~x/sys] ,

When you are working with Spark Sql interface you can configure the behavior 
and you have 3 policies for type coercion rules. 
([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you 
set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you 
expect, but it's not the policy by default.

I hope it help you.


was (Author: planga82):
Hi [~x/sys] ,

When you are working with Spark Sql interface you can configure the behavior 
and you have 3 policies for type coercion rules. 
([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you 
set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you 
expected, but it's not the policy by default.

I hope it help you.

> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
> 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> Time taken: 0.216 seconds
> spark-sql> insert into int_floating_point_vals select 1.1;
> Time taken: 1.747 seconds
> spark-sql> select * from int_floating_point_vals;
> 1
> Time taken: 0.518 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}INT{}}} and {{{}1.1{}}}).
> h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of 
> the aforementioned value correctly raises an exception:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(Seq(Row(1.1)))
> val schema = new StructType().add(StructField("c1", IntegerType, true))
> val df = spark.createDataFrame(rdd, schema)
> df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals")
>  {code}
> The following exception is raised:
> {code:java}
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of int{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40525) Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame but evaluates to a rounded value in SparkSQL

2023-02-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693721#comment-17693721
 ] 

Pablo Langa Blanco commented on SPARK-40525:


Hi [~x/sys] ,

When you are working with Spark Sql interface you can configure the behavior 
and you have 3 policies for type coercion rules. 
([https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)] If you 
set "strict" in spark.sql.storeAssignmentPolicy it's going to happen what you 
expected, but it's not the policy by default.

I hope it help you.

> Floating-point value with an INT/BYTE/SHORT/LONG type errors out in DataFrame 
> but evaluates to a rounded value in SparkSQL
> --
>
> Key: SPARK-40525
> URL: https://issues.apache.org/jira/browse/SPARK-40525
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> Storing an invalid INT value {{1.1}} using DataFrames via {{spark-shell}} 
> expectedly errors out. However, it is evaluated to a rounded value {{1}} if 
> the value is inserted into the table via {{{}spark-sql{}}}.
> h3. Steps to reproduce:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql {code}
> Execute the following:
> {code:java}
> spark-sql> create table int_floating_point_vals(c1 INT) stored as ORC;
> 22/09/19 16:49:11 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> Time taken: 0.216 seconds
> spark-sql> insert into int_floating_point_vals select 1.1;
> Time taken: 1.747 seconds
> spark-sql> select * from int_floating_point_vals;
> 1
> Time taken: 0.518 seconds, Fetched 1 row(s){code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type & input combination 
> ({{{}INT{}}} and {{{}1.1{}}}).
> h4. Here is a simplified example in {{{}spark-shell{}}}, where insertion of 
> the aforementioned value correctly raises an exception:
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(Seq(Row(1.1)))
> val schema = new StructType().add(StructField("c1", IntegerType, true))
> val df = spark.createDataFrame(rdd, schema)
> df.write.mode("overwrite").format("orc").saveAsTable("int_floating_point_vals")
>  {code}
> The following exception is raised:
> {code:java}
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of int{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42262) Table schema changes via V2SessionCatalog with HiveExternalCatalog

2023-01-31 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17682727#comment-17682727
 ] 

Pablo Langa Blanco commented on SPARK-42262:


I'm working on this

> Table schema changes via V2SessionCatalog with HiveExternalCatalog
> --
>
> Key: SPARK-42262
> URL: https://issues.apache.org/jira/browse/SPARK-42262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> When configuring HiveExternalCatalog there are certain schema change 
> operations in V2SessionCatalog that are not performed due to the use of the 
> alterTable method instead of alterTableDataSchema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42262) Table schema changes via V2SessionCatalog with HiveExternalCatalog

2023-01-31 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-42262:
--

 Summary: Table schema changes via V2SessionCatalog with 
HiveExternalCatalog
 Key: SPARK-42262
 URL: https://issues.apache.org/jira/browse/SPARK-42262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Pablo Langa Blanco


When configuring HiveExternalCatalog there are certain schema change operations 
in V2SessionCatalog that are not performed due to the use of the alterTable 
method instead of alterTableDataSchema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644061#comment-17644061
 ] 

Pablo Langa Blanco commented on SPARK-41344:


In this case the provider has been detected as DataSourceV2 and also implements 
SupportsCatalogOptions, so if it fails at that point, it does not make sense to 
try it as DataSource V1.

The CatalogV2Util.loadTable function catches NoSuchTableException, 
NoSuchDatabaseException and NoSuchNamespaceException to return an option, which 
makes sense in other places where it is used, but not at this point. Maybe the 
best solution is to have another function that does not catch those exceptions 
to use in this case and does not return an option.

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-08-17 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-39623.

Resolution: Not A Problem

> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576282#comment-17576282
 ] 

Pablo Langa Blanco commented on SPARK-39915:


Hi [~zsxwing] ,

I can't reproduce it, do you have a typo in range?
{code:java}
scala> spark.range(0, 10).repartition(5).rdd.getNumPartitions
res53: Int = 5{code}

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

2022-08-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576281#comment-17576281
 ] 

Pablo Langa Blanco edited comment on SPARK-39796 at 8/6/22 9:17 PM:


Hi [~augustine_theodore] ,

Is this what are you loking for?
{code:java}
scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)"
regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+)

scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.withColumn("g1", regexp_extract('a, regex, 1)).withColumn("g2", 
regexp_extract('a, regex, 2)).show
+--+-++
|                 a|   g1|  g2|
+--+-++
|Hello, World, 1234|Hello|1234|
| Good, bye, friend|     |    |
+--+-++{code}


was (Author: planga82):
Hi [~augustine_theodore] ,

Is this what are you loking for?
{code:java}
scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)"
regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+)

scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 
1)).withColumn("g2", regexp_extract('a, regex, 2)).show
+--+-++
|                 a|   g1|  g2|
+--+-++
|Hello, World, 1234|Hello|1234|
| Good, bye, friend|     |    |
+--+-++{code}

> Add a regexp_extract variant which returns an array of all the matched 
> capture groups
> -
>
> Key: SPARK-39796
> URL: https://issues.apache.org/jira/browse/SPARK-39796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Augustine Theodore Prince
>Priority: Minor
>  Labels: regexp_extract, regexp_extract_all, regexp_replace
>
>  
> regexp_extract only returns a single matched group. In a lot of cases we need 
> to parse the entire string and get all the groups and for that we'll need to 
> call it as many times as there are groups. The regexp_extract_all function 
> doesn't solve this problem as it only works if all the groups have the same 
> regex pattern.
>  
> _Example:_
> I will provide an example and the current workaround that I use to solve this,
> If I have the following dataframe and I would like to match the column 'a' 
> with this pattern
> {code:java}
> "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
> |a|
> |Hello, World, 1234|
> |Good, bye, friend|
>  
> My expected output  is as follows:
> |a|extracted_a|
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|[]|
>  
> However, to achieve this I have to take the following approach which seems 
> very hackish.
> 1. Use regexp_replace to create a temporary string built using the extracted 
> groups:
> {code:java}
> df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, 
> (\\d+)", "$1_$2")){code}
> A side effect of regexp_replace is that if the regex fails to match the 
> entire string is returned.
>  
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|Good, bye, friend|
> 2. So, to achieve the desired result, a check has to be done to prune the 
> rows that did not match with the pattern :
> {code:java}
> df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , 
> None).otherwise(F.col("extracted_a"))){code}
>  
> to get the following intermediate dataframe,
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|null|
>  
> 3. Before finally splitting the column 'extracted_a' based on underscores
> {code:java}
> df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
> which results in the desired result:
>  
>  
> |a|extracted_a
> |
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|null|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

2022-08-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576281#comment-17576281
 ] 

Pablo Langa Blanco commented on SPARK-39796:


Hi [~augustine_theodore] ,

Is this what are you loking for?
{code:java}
scala> val regex = "([A-Za-z]+), [A-Za-z]+, (\\d+)"
regex: String = ([A-Za-z]+), [A-Za-z]+, (\d+)

scala> val df = Seq("Hello, World, 1234", "Good, bye, friend").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.withColumn("g1", regexp_extract('a, "([A-Za-z]+), [A-Za-z]+, (\\d+)", 
1)).withColumn("g2", regexp_extract('a, regex, 2)).show
+--+-++
|                 a|   g1|  g2|
+--+-++
|Hello, World, 1234|Hello|1234|
| Good, bye, friend|     |    |
+--+-++{code}

> Add a regexp_extract variant which returns an array of all the matched 
> capture groups
> -
>
> Key: SPARK-39796
> URL: https://issues.apache.org/jira/browse/SPARK-39796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Augustine Theodore Prince
>Priority: Minor
>  Labels: regexp_extract, regexp_extract_all, regexp_replace
>
>  
> regexp_extract only returns a single matched group. In a lot of cases we need 
> to parse the entire string and get all the groups and for that we'll need to 
> call it as many times as there are groups. The regexp_extract_all function 
> doesn't solve this problem as it only works if all the groups have the same 
> regex pattern.
>  
> _Example:_
> I will provide an example and the current workaround that I use to solve this,
> If I have the following dataframe and I would like to match the column 'a' 
> with this pattern
> {code:java}
> "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
> |a|
> |Hello, World, 1234|
> |Good, bye, friend|
>  
> My expected output  is as follows:
> |a|extracted_a|
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|[]|
>  
> However, to achieve this I have to take the following approach which seems 
> very hackish.
> 1. Use regexp_replace to create a temporary string built using the extracted 
> groups:
> {code:java}
> df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, 
> (\\d+)", "$1_$2")){code}
> A side effect of regexp_replace is that if the regex fails to match the 
> entire string is returned.
>  
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|Good, bye, friend|
> 2. So, to achieve the desired result, a check has to be done to prune the 
> rows that did not match with the pattern :
> {code:java}
> df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , 
> None).otherwise(F.col("extracted_a"))){code}
>  
> to get the following intermediate dataframe,
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|null|
>  
> 3. Before finally splitting the column 'extracted_a' based on underscores
> {code:java}
> df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
> which results in the desired result:
>  
>  
> |a|extracted_a
> |
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|null|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39426) Subquery star select creates broken plan in case of self join

2022-07-10 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564735#comment-17564735
 ] 

Pablo Langa Blanco commented on SPARK-39426:


I tested it on master and 3.3.0 and it seems to be fixed.

> Subquery star select creates broken plan in case of self join
> -
>
> Key: SPARK-39426
> URL: https://issues.apache.org/jira/browse/SPARK-39426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Denis
>Priority: Major
>
> Subquery star select creates broken plan in case of self join
> How to reproduce: 
> {code:java}
> import spark.implicits._
> spark.sparkContext.setCheckpointDir(Files.createTempDirectory("some-prefix").toFile.toString)
> val frame = Seq(1).toDF("id").checkpoint()
> val joined = frame
> .join(frame, Seq("id"), "left")
> .select("id")
> joined
> .join(joined, Seq("id"), "left")
> .as("a")
> .select("a.*"){code}
> This query throws exception: 
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) id#7 missing from id#10,id#11 in operator !Project [id#7, 
> id#10]. Attribute(s) with the same name appear in the operation: id. Please 
> check if the right attribute(s) are used.;
> Project [id#10, id#4]
> +- SubqueryAlias a
>    +- Project [id#10, id#4]
>       +- Join LeftOuter, (id#4 = id#10)
>          :- Project [id#4]
>          :  +- Project [id#7, id#4]
>          :     +- Join LeftOuter, (id#4 = id#7)
>          :        :- LogicalRDD [id#4], false
>          :        +- LogicalRDD [id#7], false
>          +- Project [id#10]
>             +- !Project [id#7, id#10]
>                +- Join LeftOuter, (id#10 = id#11)
>                   :- LogicalRDD [id#10], false
>                   +- LogicalRDD [id#11], false    at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:182)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:471)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>     at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>     at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)

[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-07-10 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564730#comment-17564730
 ] 

Pablo Langa Blanco commented on SPARK-39623:


I think the problem here is a misunderstanding of how lowerBound and upperBound 
work. 
This options only affect on how spark generate the partitions, but all data is 
returned, this options don't work as a filter.

For example with this configuration

{code:java}
.option("numPartitions", "4")
.option("partitionColumn", "t")
.option("lowerBound", "2022-07-10 00:00:00")
.option("upperBound", "2022-07-10 23:59:00")
{code}


The expected filters of the queries are

{code:sql}
WHERE "t" < '2022-07-10 05:59:45' or "t" is null
WHERE "t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30'
WHERE "t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15'
WHERE "t" >= '2022-07-10 17:59:15'
{code}


If you want to filter the data, you can do it as you have shown or doing 
something like that


{code:java}
df.where(col("t") >= lit("2022-07-10 00:00:00"))
{code}


and then spark pushes down this filter and generates these partitions:


{code:sql}
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" < 
'2022-07-10 05:59:45' or "t" is null)
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30')
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15')
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 17:59:15')
{code}


> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38714) Interval multiplication error

2022-07-09 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-38714.

Resolution: Resolved

> Interval multiplication error
> -
>
> Key: SPARK-38714
> URL: https://issues.apache.org/jira/browse/SPARK-38714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: branch-3.3,  Java 8
>  
>Reporter: chong
>Priority: Major
>
> Code gen have error when multipling interval by a decimal.
>  
> $SPARK_HOME/bin/spark-shell
>  
> import org.apache.spark.sql.Row
> import java.time.Duration
> import java.time.Period
> import org.apache.spark.sql.types._
> val data = Seq(Row(new java.math.BigDecimal("123456789.11")))
> val schema = StructType(Seq(
> StructField("c1", DecimalType(9, 2)),
> ))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.selectExpr("interval '100' second * c1").show(false)
> errors are:
> *{color:#FF}java.lang.AssertionError: assertion failed:{color}*
> Decimal$DecimalIsFractional
> while compiling: 
> during phase: globalPhase=terminal, enteringPhase=jvm
> library version: version 2.12.15
> compiler version: version 2.12.15
> reconstructed args: -classpath -Yrepl-class-based -Yrepl-outdir 
> /tmp/spark-83a0cda4-dd0b-472e-ad8b-a4b33b85f613/repl-06489815-5366-4aa0-9419-f01abda8d041
> last tree to typer: TypeTree(class Byte)
> tree position: line 6 of 
> tree tpe: Byte
> symbol: (final abstract) class Byte in package scala
> symbol definition: final abstract class Byte extends (a ClassSymbol)
> symbol package: scala
> symbol owners: class Byte
> call site: constructor $eval in object $eval in package $line21
> == Source file context for tree position ==
> 3
> 4 object $eval {
> 5 lazy val $result = 
> $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
> 6 lazy val $print: {_}root{_}.java.lang.String = {
> 7 $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
> 8
> 9 ""
> at 
> scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
> at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
> at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
> at 
> scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
> at 
> scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
> at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
> at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
> at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
> at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96)
> at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88)
> at scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.unpickleOrParseInnerClasses(ClassfileParser.scala:1186)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:468)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$2(ClassfileParser.scala:161)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$1(ClassfileParser.scala:147)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:130)
> at 
> scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:343)
> at 
> scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:250)
> at 
> scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.load(SymbolLoaders.scala:269)
> at 

[jira] [Commented] (SPARK-38714) Interval multiplication error

2022-07-09 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564619#comment-17564619
 ] 

Pablo Langa Blanco commented on SPARK-38714:


I have tested it in master and branch 3.3 and it's solved. 

> Interval multiplication error
> -
>
> Key: SPARK-38714
> URL: https://issues.apache.org/jira/browse/SPARK-38714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: branch-3.3,  Java 8
>  
>Reporter: chong
>Priority: Major
>
> Code gen have error when multipling interval by a decimal.
>  
> $SPARK_HOME/bin/spark-shell
>  
> import org.apache.spark.sql.Row
> import java.time.Duration
> import java.time.Period
> import org.apache.spark.sql.types._
> val data = Seq(Row(new java.math.BigDecimal("123456789.11")))
> val schema = StructType(Seq(
> StructField("c1", DecimalType(9, 2)),
> ))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.selectExpr("interval '100' second * c1").show(false)
> errors are:
> *{color:#FF}java.lang.AssertionError: assertion failed:{color}*
> Decimal$DecimalIsFractional
> while compiling: 
> during phase: globalPhase=terminal, enteringPhase=jvm
> library version: version 2.12.15
> compiler version: version 2.12.15
> reconstructed args: -classpath -Yrepl-class-based -Yrepl-outdir 
> /tmp/spark-83a0cda4-dd0b-472e-ad8b-a4b33b85f613/repl-06489815-5366-4aa0-9419-f01abda8d041
> last tree to typer: TypeTree(class Byte)
> tree position: line 6 of 
> tree tpe: Byte
> symbol: (final abstract) class Byte in package scala
> symbol definition: final abstract class Byte extends (a ClassSymbol)
> symbol package: scala
> symbol owners: class Byte
> call site: constructor $eval in object $eval in package $line21
> == Source file context for tree position ==
> 3
> 4 object $eval {
> 5 lazy val $result = 
> $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0
> 6 lazy val $print: {_}root{_}.java.lang.String = {
> 7 $line21.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw
> 8
> 9 ""
> at 
> scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185)
> at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525)
> at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514)
> at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346)
> at 
> scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348)
> at 
> scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487)
> at 
> scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
> at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799)
> at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805)
> at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357)
> at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:357)
> at 
> scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$run$1(UnPickler.scala:96)
> at scala.reflect.internal.pickling.UnPickler$Scan.run(UnPickler.scala:88)
> at scala.reflect.internal.pickling.UnPickler.unpickle(UnPickler.scala:47)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.unpickleOrParseInnerClasses(ClassfileParser.scala:1186)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:468)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$2(ClassfileParser.scala:161)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.$anonfun$parse$1(ClassfileParser.scala:147)
> at 
> scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:130)
> at 
> scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:343)
> at 
> scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:250)
> at 
> 

[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names

2022-03-02 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500419#comment-17500419
 ] 

Pablo Langa Blanco commented on SPARK-37895:


I'm working on a fix

> Error while joining two tables with non-english field names
> ---
>
> Key: SPARK-37895
> URL: https://issues.apache.org/jira/browse/SPARK-37895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Marina Krasilnikova
>Priority: Minor
>
> While trying to join two tables with non-english field names in postgresql 
> with query like
> "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join  view2 
> on view1.`Имя2`=view2.`Имя4`"
> we get an error which says that there is no field "`Имя4`" (field name is 
> surrounded by backticks).
> It appears that to get the data from the second table it constructs query like
> SELECT "Имя3","Имя4" FROM "public"."tab2"  WHERE ("`Имя4`" IS NOT NULL) 
> and these backticks are redundant in WHERE clause.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression

2021-09-08 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-36698:
---
Description: 
With
{code:java}
spark.sql.parser.quotedRegexColumnNames=true{code}
regular expressions are not allowed in expressions like
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 I propose to improve this behavior and allow this regular expression when it 
expands to only one named expression. It’s the same behavior in Hive.

*Example 1:* 
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 col_.* expands to col_a: OK

*Example 2:*
{code:java}
SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code}
col_.* expands to (col_a, col_b): Fail like now

 

Related with SPARK-36488

  was:
With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not 
allowed in expressions like
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 I propose to improve this behavior and allow this regular expression when it 
expands to only one named expression. It’s the same behavior in Hive.

*Example 1:* 
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 col_.* expands to col_a: OK

*Example 2:*
{code:java}
SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code}
col_.* expands to (col_a, col_b): Fail like now

 

Related with SPARK-36488


> Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded 
> to one named expression
> ---
>
> Key: SPARK-36698
> URL: https://issues.apache.org/jira/browse/SPARK-36698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> With
> {code:java}
> spark.sql.parser.quotedRegexColumnNames=true{code}
> regular expressions are not allowed in expressions like
> {code:java}
> SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
>  I propose to improve this behavior and allow this regular expression when it 
> expands to only one named expression. It’s the same behavior in Hive.
> *Example 1:* 
> {code:java}
> SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
>  col_.* expands to col_a: OK
> *Example 2:*
> {code:java}
> SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code}
> col_.* expands to (col_a, col_b): Fail like now
>  
> Related with SPARK-36488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression

2021-09-08 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412220#comment-17412220
 ] 

Pablo Langa Blanco commented on SPARK-36698:


I'm working on it

> Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded 
> to one named expression
> ---
>
> Key: SPARK-36698
> URL: https://issues.apache.org/jira/browse/SPARK-36698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not 
> allowed in expressions like
> {code:java}
> SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
>  I propose to improve this behavior and allow this regular expression when it 
> expands to only one named expression. It’s the same behavior in Hive.
> *Example 1:* 
> {code:java}
> SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
>  col_.* expands to col_a: OK
> *Example 2:*
> {code:java}
> SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code}
> col_.* expands to (col_a, col_b): Fail like now
>  
> Related with SPARK-36488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36698) Allow expand 'quotedRegexColumnNames' in all expressions when it’s expanded to one named expression

2021-09-08 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-36698:
--

 Summary: Allow expand 'quotedRegexColumnNames' in all expressions 
when it’s expanded to one named expression
 Key: SPARK-36698
 URL: https://issues.apache.org/jira/browse/SPARK-36698
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Pablo Langa Blanco


With spark.sql.parser.quotedRegexColumnNames=true regular expressions are not 
allowed in expressions like
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 I propose to improve this behavior and allow this regular expression when it 
expands to only one named expression. It’s the same behavior in Hive.

*Example 1:* 
{code:java}
SELECT `col_.*`/exp FROM (SELECT 3 AS col_a, 1 as exp){code}
 col_.* expands to col_a: OK

*Example 2:*
{code:java}
SELECT `col_.*`/col_b FROM (SELECT 3 AS col_a, 1 as col_b){code}
col_.* expands to (col_a, col_b): Fail like now

 

Related with SPARK-36488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2021-08-23 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403329#comment-17403329
 ] 

Pablo Langa Blanco commented on SPARK-36488:


Yes, I'm with you that it's not very intuitive and it have room for 
improvement. I think it's interesting to take a closer look to support more 
expressions and to be more close to hive feature.

Thanks!

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Priority: Major
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue : 
> https://issues.apache.org/jira/browse/SPARK-12139
> (As can be seen in the latest comments, some people have encountered the same 
> problem)
>  
> Similar problems:
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35933) PartitionFilters and pushFilters not applied to window functions

2021-08-19 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401895#comment-17401895
 ] 

Pablo Langa Blanco commented on SPARK-35933:


can we close this ticket?

> PartitionFilters and pushFilters not applied to window functions
> 
>
> Key: SPARK-35933
> URL: https://issues.apache.org/jira/browse/SPARK-35933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Shreyas Kothavade
>Priority: Major
>
> Spark does not apply partition and pushed filters when the partition by 
> column and window function partition columns are not the same. For example, 
> in the code below, the data frame is created with a partition on "id". And I 
> use the partitioned data frame to calculate lag which is partitioned by 
> "name". In this case, the query plan shows the partitionFilters and pushed 
> Filters as empty.
> {code:java}
> spark
>   .createDataFrame(
>     Seq(
>       Person(
>         1,
>         "Andy",
>         new Timestamp(1499955986039L),
>         new Timestamp(1499955982342L)
>       ),
>       Person(
>         2,
>         "Jeff",
>         new Timestamp(1499955986339L),
>         new Timestamp(149995598L)
>       )
>     )
>   )
>  .write
>   .partitionBy("id")
>   .mode(SaveMode.Append)
>   .parquet("spark-warehouse/people")
> val dfPeople =
>   spark.read.parquet("spark-warehouse/people")
> dfPeople
>   .select(
>     $"id",
>     $"name",
>     lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts"))
>   )
>   .filter($"id" === 1)
>   .explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36488) "Invalid usage of '*' in expression" error due to the feature of 'quotedRegexColumnNames' in some scenarios.

2021-08-18 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17401401#comment-17401401
 ] 

Pablo Langa Blanco commented on SPARK-36488:


Hi [~merrily01] ,

I think these are not bugs. As you can review in the documentation 
([https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html]) when 
you set the parameter spark.sql.parser.quotedRegexColumnNames=true “quoted 
identifiers (using backticks) in SELECT statement are interpreted as regular 
expressions and SELECT statement can take regex-based column specification”.

In case 1, this means that in the expression `tb_test`.`col_a`  tb_test is 
treated as a regular expression that represents a column. And this syntaxis  
“column.field” is used to access structType columns. And in this case it is not 
allowed to use regular expressions.

In case 2, a regex can retrieve more than one column, for example `col_*` is 
resolved to col_a, col_b, so it does not make sense that the operators of a 
division are a list of columns, this is not allowed.

I’m going to open a PR trying to improve the error message to avoid confusion.

> "Invalid usage of '*' in expression" error due to the feature of 
> 'quotedRegexColumnNames' in some scenarios.
> 
>
> Key: SPARK-36488
> URL: https://issues.apache.org/jira/browse/SPARK-36488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: merrily01
>Priority: Major
>
>  In some cases, the error happens when the following property is set.
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> *case 1:* 
> {code:java}
> spark-sql> create table tb_test as select 1 as col_a, 2 as col_b;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> 1
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `tb_test`.`col_a`  from tb_test;
> Error in query: Invalid usage of '*' in expression 'unresolvedextractvalue'
> {code}
>  
> *case 2:*
> {code:java}
>          > select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> 0.955414
> spark-sql> set spark.sql.parser.quotedRegexColumnNames=true;
> spark-sql> select `col_a`/`col_b` as `col_c` from (select 3 as `col_a` ,  
> 3.14 as `col_b`);
> Error in query: Invalid usage of '*' in expression 'divide'
> {code}
>  
> This problem exists in 3.X, 2.4.X and master versions. 
>  
> Related issue : 
> https://issues.apache.org/jira/browse/SPARK-12139
> (As can be seen in the latest comments, some people have encountered the same 
> problem)
>  
> Similar problems:
> https://issues.apache.org/jira/browse/SPARK-28897
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36453) Improve consistency processing floating point special literals

2021-08-08 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395658#comment-17395658
 ] 

Pablo Langa Blanco commented on SPARK-36453:


I'm working on it

> Improve consistency processing floating point special literals
> --
>
> Key: SPARK-36453
> URL: https://issues.apache.org/jira/browse/SPARK-36453
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> Special literal in floating point are not consistent between cast and json 
> expressions
>  
> {code:java}
> scala> spark.sql("SELECT CAST('+Inf' as Double)").show
> ++
> |CAST(+Inf AS DOUBLE)|
> ++
> |            Infinity|
> ++
> {code}
>  
> {code:java}
> scala> val schema =  StructType(StructField("a", DoubleType) :: Nil)
> scala> Seq("""{"a" : 
> "+Inf"}""").toDF("col1").select(from_json(col("col1"),schema)).show
> +---+
> |from_json(col1)|
> +---+
> |         {null}|
> +---+
> scala> Seq("""{"a" : "+Inf"}""").toDF("col").withColumn("col", 
> from_json(col("col"), StructType.fromDDL("a 
> DOUBLE"))).write.json("/tmp/jsontests12345")
> scala> 
> spark.read.schema(StructType(Seq(StructField("col",schema.json("/tmp/jsontests12345").show
> +--+
> |   col|
> +--+
> |{null}|
> +--+
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36453) Improve consistency processing floating point special literals

2021-08-08 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-36453:
--

 Summary: Improve consistency processing floating point special 
literals
 Key: SPARK-36453
 URL: https://issues.apache.org/jira/browse/SPARK-36453
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Pablo Langa Blanco


Special literal in floating point are not consistent between cast and json 
expressions

 
{code:java}
scala> spark.sql("SELECT CAST('+Inf' as Double)").show
++
|CAST(+Inf AS DOUBLE)|
++
|            Infinity|
++
{code}
 
{code:java}
scala> val schema =  StructType(StructField("a", DoubleType) :: Nil)

scala> Seq("""{"a" : 
"+Inf"}""").toDF("col1").select(from_json(col("col1"),schema)).show

+---+
|from_json(col1)|
+---+
|         {null}|
+---+

scala> Seq("""{"a" : "+Inf"}""").toDF("col").withColumn("col", 
from_json(col("col"), StructType.fromDDL("a 
DOUBLE"))).write.json("/tmp/jsontests12345")
scala> 
spark.read.schema(StructType(Seq(StructField("col",schema.json("/tmp/jsontests12345").show

+--+
|   col|
+--+
|{null}|
+--+

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35320) from_json cannot parse maps with timestamp as key

2021-05-16 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345782#comment-17345782
 ] 

Pablo Langa Blanco commented on SPARK-35320:


I'm taking a look at this

> from_json cannot parse maps with timestamp as key
> -
>
> Key: SPARK-35320
> URL: https://issues.apache.org/jira/browse/SPARK-35320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1
> Environment: * Java 11
>  * Spark 3.0.1/3.1.1
>  * Scala 2.12
>Reporter: Vincenzo Cerminara
>Priority: Minor
>
> I have a json that contains a {{map}} like the following
> {code:json}
> {
>   "map": {
> "2021-05-05T20:05:08": "sampleValue"
>   }
> }
> {code}
> The key of the map is a string containing a formatted timestamp and I want to 
> parse it as a Java Map using the from_json 
> Spark SQL function (see the {{Sample}} class in the code below).
> {code:java}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import java.io.Serializable;
> import java.time.Instant;
> import java.util.List;
> import java.util.Map;
> import static org.apache.spark.sql.functions.*;
> public class TimestampAsJsonMapKey {
> public static class Sample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static class InvertedSample implements Serializable {
> private Map map;
> 
> public Map getMap() {
> return map;
> }
> 
> public void setMap(Map map) {
> this.map = map;
> }
> }
> public static void main(String[] args) {
> final SparkSession spark = SparkSession
> .builder()
> .appName("Timestamp As Json Map Key Test")
> .master("local[1]")
> .getOrCreate();
> workingTest(spark);
> notWorkingTest(spark);
> }
> private static void workingTest(SparkSession spark) {
> //language=JSON
> final String invertedSampleJson = "{ \"map\": { \"sampleValue\": 
> \"2021-05-05T20:05:08\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(invertedSampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(InvertedSample.class).schema()));
> parsedDf.show(false);
> }
> private static void notWorkingTest(SparkSession spark) {
> //language=JSON
> final String sampleJson = "{ \"map\": { \"2021-05-05T20:05:08\": 
> \"sampleValue\" } }";
> final Dataset samplesDf = 
> spark.createDataset(List.of(sampleJson), Encoders.STRING());
> final Dataset parsedDf = 
> samplesDf.select(from_json(col("value"), 
> Encoders.bean(Sample.class).schema()));
> parsedDf.show(false);
> }
> }
> {code}
> When I run the {{notWorkingTest}} method it fails with the following 
> exception:
> {noformat}
> Exception in thread "main" java.lang.ClassCastException: class 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to class 
> java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module 
> of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$12(Cast.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$11(Cast.scala:321)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$14(Cast.scala:359)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$13(Cast.scala:352)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
>   at 
> 

[jira] [Commented] (SPARK-35207) hash() and other hash builtins do not normalize negative zero

2021-05-03 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338698#comment-17338698
 ] 

Pablo Langa Blanco commented on SPARK-35207:


Hi [~tarmstrong] ,

I have read about signed zero and it's a little bit tricky, from what I have 
read in IEEE 754 and it's implementation in Java I understand the same as you 
it's not consistent to have different hash values for 0 and -0. This applays to 
double and float.

Here is an extract from IEEE 754: "The two zeros are distinguishable 
arithmetically only by either division-byzero ( producing appropriately signed 
infinities ) or else by the CopySign function recommended by IEEE 754 /854. 
Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special 
cases"

I will raise a PR with this.

Regards

> hash() and other hash builtins do not normalize negative zero
> -
>
> Key: SPARK-35207
> URL: https://issues.apache.org/jira/browse/SPARK-35207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: correctness
>
> I would generally expect that {{x = y => hash( x ) = hash( y )}}. However +-0 
> hash to different values for floating point types. 
> {noformat}
> scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as 
> double))").show
> +-+--+
> |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
> +-+--+
> |  -1670924195|-853646085|
> +-+--+
> scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as 
> double)").show
> ++
> |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
> ++
> |true|
> ++
> {noformat}
> I'm not sure how likely this is to cause issues in practice, since only a 
> limited number of calculations can produce -0 and joining or aggregating with 
> floating point keys is a bad practice as a general rule, but I think it would 
> be safer if we normalised -0.0 to +0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35249) to_timestamp can't parse 6 digit microsecond SSSSSS

2021-05-02 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338139#comment-17338139
 ] 

Pablo Langa Blanco commented on SPARK-35249:


Hi [~toopt4],

I think there are two different issues here. First, the correct pattern for 
this string is as follows.
{code:java}
select x, to_timestamp(x,"MMM dd  hh:mm:ss.SSa") from (select 'Apr 13 
2021 12:00:00.001000AM' x);
{code}
The other problem is that in spark 2.4.6 this pattern (SS) with more than 5 
characters apparently doesn't work. But I have tested it in 3.1.1 version and 
it works fine.

Regards

> to_timestamp can't parse 6 digit microsecond SS
> ---
>
> Key: SPARK-35249
> URL: https://issues.apache.org/jira/browse/SPARK-35249
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
>
> spark-sql> select x, to_timestamp(x,"MMM dd  hh:mm:ss.SS") from 
> (select 'Apr 13 2021 12:00:00.001000AM' x);
> Apr 13 2021 12:00:00.001000AM NULL
>  
> Why doesn't the to_timestamp work?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-04-06 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315782#comment-17315782
 ] 

Pablo Langa Blanco commented on SPARK-34464:


Hi [~lfruleux] ,

Thank you for reporting it! I think it's a good point to try to improve. Yeah, 
Last and other functions are in the same situation, so if we go forward with 
this changes we could apply the same to this functions. All help is welcome! :D

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-04-05 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315152#comment-17315152
 ] 

Pablo Langa Blanco commented on SPARK-34464:


I've open SPARK-34961 because I think It's an improvement not a bug so I will 
raise a PR there.

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-04-05 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-34464.

Resolution: Not A Bug

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance

2021-04-05 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-34961:
--

 Summary: Migrate First function from DeclarativeAggregate to 
TypedImperativeAggregate to improve performance
 Key: SPARK-34961
 URL: https://issues.apache.org/jira/browse/SPARK-34961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Pablo Langa Blanco


The main objective of this change is to improve performance in some cases.

We have three possibilities when we plan an aggregation. In the first case, 
with mutable primitive types, HashAggregate is used.

When we are not using these types we have two options. If the function 
implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we 
use SortAggregate that is less efficient.

In this PR I propose to migrate First function to implement 
TypedImperativeAggregate to take advantage of this feature (ObjectAggregateExec)

This Jira is related to SPARK-34464

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-03-30 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311845#comment-17311845
 ] 

Pablo Langa Blanco commented on SPARK-34464:


Hi [~lfruleux],

Here it's a link that explain very good the reasons when the different types of 
aggregations are applied. 

[https://www.waitingforcode.com/apache-spark-sql/aggregations-execution-apache-spark-sql/read]

In the case you expose there are two things that make the aggregation fallback 
in a SortAggregate. The first is that the types of the aggregation are not 
primitive mutable types (necessary for HashAggregate). The first fallback is 
ObjectHashAggregate, but in this case first function is not supported by 
ObjectHashAggregate because it's not a TypedImperativeAggregate, so it fallback 
to SorteAggregate.

I don't know if this has any reason, I'm going to take a look if it's possible 
to TypedImperativeAggregate to fallback to ObjectHashAggregate.

Thanks!

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33744) Canonicalization error in SortAggregate

2020-12-13 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248574#comment-17248574
 ] 

Pablo Langa Blanco commented on SPARK-33744:


I have seen the problem, but I don't have enough knowledge to give a solution. 
Sorry

> Canonicalization error in SortAggregate
> ---
>
> Key: SPARK-33744
> URL: https://issues.apache.org/jira/browse/SPARK-33744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Andy Grove
>Priority: Minor
>
> The canonicalization plan for a simple aggregate query is different each time 
> for SortAggregate but not for HashAggregate.
> The issue can be demonstrated by adding the following unit tests to 
> SQLQuerySuite. The HashAggregate test passes and the SortAggregate test fails.
> The first test has numeric input and the second test is operating on strings, 
> which forces the use of SortAggregate rather than HashAggregate.
> {code:java}
> test("HashAggregate canonicalization") {
>   val data = Seq((1, 1)).toDF("c0", "c1")
>   val df1 = data.groupBy(col("c0")).agg(first("c1"))
>   val df2 = data.groupBy(col("c0")).agg(first("c1"))
>   assert(df1.queryExecution.executedPlan.canonicalized ==
>   df2.queryExecution.executedPlan.canonicalized)
> }
> test("SortAggregate canonicalization") {
>   val data = Seq(("a", "a")).toDF("c0", "c1")
>   val df1 = data.groupBy(col("c0")).agg(first("c1"))
>   val df2 = data.groupBy(col("c0")).agg(first("c1"))
>   assert(df1.queryExecution.executedPlan.canonicalized ==
>   df2.queryExecution.executedPlan.canonicalized)
> } {code}
> The SortAggregate test fails with the following output .
> {code:java}
> SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, 
> #1])
> +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#105]
>   +- SortAggregate(key=[none#0], functions=[partial_first(none#1, 
> false)], output=[none#0, none#2, none#3])
>  +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0
> +- *(1) Project [none#0 AS #0, none#1 AS #1]
>+- *(1) LocalTableScan [none#0, none#1]
>  did not equal 
> SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, 
> #1])
> +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#148]
>   +- SortAggregate(key=[none#0], functions=[partial_first(none#1, 
> false)], output=[none#0, none#2, none#3])
>  +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0
> +- *(1) Project [none#0 AS #0, none#1 AS #1]
>+- *(1) LocalTableScan [none#0, none#1] {code}
> The error is caused by the resultExpression for the aggregate function being 
> assigned a new ExprId in the final aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33744) Canonicalization error in SortAggregate

2020-12-12 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248417#comment-17248417
 ] 

Pablo Langa Blanco commented on SPARK-33744:


I'm taking a look at it, thanks for reporting.

 

> Canonicalization error in SortAggregate
> ---
>
> Key: SPARK-33744
> URL: https://issues.apache.org/jira/browse/SPARK-33744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Andy Grove
>Priority: Minor
>
> The canonicalization plan for a simple aggregate query is different each time 
> for SortAggregate but not for HashAggregate.
> The issue can be demonstrated by adding the following unit tests to 
> SQLQuerySuite. The HashAggregate test passes and the SortAggregate test fails.
> The first test has numeric input and the second test is operating on strings, 
> which forces the use of SortAggregate rather than HashAggregate.
> {code:java}
> test("HashAggregate canonicalization") {
>   val data = Seq((1, 1)).toDF("c0", "c1")
>   val df1 = data.groupBy(col("c0")).agg(first("c1"))
>   val df2 = data.groupBy(col("c0")).agg(first("c1"))
>   assert(df1.queryExecution.executedPlan.canonicalized ==
>   df2.queryExecution.executedPlan.canonicalized)
> }
> test("SortAggregate canonicalization") {
>   val data = Seq(("a", "a")).toDF("c0", "c1")
>   val df1 = data.groupBy(col("c0")).agg(first("c1"))
>   val df2 = data.groupBy(col("c0")).agg(first("c1"))
>   assert(df1.queryExecution.executedPlan.canonicalized ==
>   df2.queryExecution.executedPlan.canonicalized)
> } {code}
> The SortAggregate test fails with the following output .
> {code:java}
> SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, 
> #1])
> +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#105]
>   +- SortAggregate(key=[none#0], functions=[partial_first(none#1, 
> false)], output=[none#0, none#2, none#3])
>  +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0
> +- *(1) Project [none#0 AS #0, none#1 AS #1]
>+- *(1) LocalTableScan [none#0, none#1]
>  did not equal 
> SortAggregate(key=[none#0], functions=[first(none#0, false)], output=[none#0, 
> #1])
> +- *(2) Sort [none#0 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(none#0, 5), ENSURE_REQUIREMENTS, [id=#148]
>   +- SortAggregate(key=[none#0], functions=[partial_first(none#1, 
> false)], output=[none#0, none#2, none#3])
>  +- *(1) Sort [none#0 ASC NULLS FIRST], false, 0
> +- *(1) Project [none#0 AS #0, none#1 AS #1]
>+- *(1) LocalTableScan [none#0, none#1] {code}
> The error is caused by the resultExpression for the aggregate function being 
> assigned a new ExprId in the final aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212242#comment-17212242
 ] 

Pablo Langa Blanco commented on SPARK-33118:


I'm working on it

> CREATE TEMPORARY TABLE fails with location
> --
>
> Key: SPARK-33118
> URL: https://issues.apache.org/jira/browse/SPARK-33118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION
>  
> {code:java}
> spark.range(3).write.parquet("/data/tmp/testspark1")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
> '/data/tmp/testspark1')")
> spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
> '/data/tmp/testspark1'")
> {code}
> The error message in both cases is 
> {code:java}
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-33118:
---
Description: 
The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
{code}
 

  was:
The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is

 

 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at 

[jira] [Created] (SPARK-33118) CREATE TEMPORARY TABLE fails with location

2020-10-12 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-33118:
--

 Summary: CREATE TEMPORARY TABLE fails with location
 Key: SPARK-33118
 URL: https://issues.apache.org/jira/browse/SPARK-33118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0
Reporter: Pablo Langa Blanco


The problem is produced when you use CREATE TEMPORARY TABLE with LOCATION

 
{code:java}
spark.range(3).write.parquet("/data/tmp/testspark1")

spark.sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path 
'/data/tmp/testspark1')")
spark.sql("CREATE TEMPORARY TABLE t USING parquet LOCATION 
'/data/tmp/testspark1'")
{code}
The error message in both cases is

 

 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-29 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203733#comment-17203733
 ] 

Pablo Langa Blanco commented on SPARK-32046:


When you cache a Dataframe, you do not save the name of the dataframe, what is 
saved is a simplified (canonicalized) plan. Then, another different Dataframe, 
with the same canonicalized plan will re-use the cached dataframe. In our 
example:

 
{code:java}
scala> val df1 = spark.range(1).select(current_timestamp as "datetime")

scala> df1.queryExecution.analyzed.canonicalized
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [current_timestamp() AS #0]
+- Range (0, 1, step=1, splits=Some(4))

scala> val df2 = spark.range(1).select(current_timestamp as "datetime")

scala>  df2.queryExecution.analyzed.canonicalized
Project [current_timestamp() AS #0]
+- Range (0, 1, step=1, splits=Some(4))

scala> df1.queryExecution.analyzed.canonicalized == 
df2.queryExecution.analyzed.canonicalized
res2: Boolean = true

scala> df2.explain(true)
== Optimized Logical Plan ==
Project [160136303378 AS datetime#6]
+- Range (0, 1, step=1, splits=Some(4))

scala> df1.cache

scala> df2.explain(true)
== Optimized Logical Plan ==
InMemoryRelation [datetime#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
   +- *(1) Project [1601363046569000 AS datetime#2]
  +- *(1) Range (0, 1, step=1, splits=4)


{code}
What do you want to say with Java implementation? what is your actual work 
arround? Just for understanding if I miss something

 

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203345#comment-17203345
 ] 

Pablo Langa Blanco commented on SPARK-32046:


[~smilegator] why are we computing current time in the optimizer? I have seen 
this coment in the Optimizer code:
{code:java}
// Technically some of the rules in Finish Analysis are not optimizer rules and 
belong more
// in the analyzer, because they are needed for correctness (e.g. 
ComputeCurrentTime).
// However, because we also use the analyzer to canonicalized queries (for view 
definition),
// we do not eliminate subqueries or compute current time in the analyzer.
{code}
For the specific case of the caching, two plans with a current_timestamp in 
their plan, shouldn't they be considered as different plans? and have a 
different canonicalized plans.

I think we have a strong reason for it, but in this case it result in a strange 
behavior

Thanks!

 

 

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203040#comment-17203040
 ] 

Pablo Langa Blanco edited comment on SPARK-32046 at 9/28/20, 4:20 PM:
--

I have been looking at the problem and I think I have understood the problem. 
What I don't understand is why Jupyter and ZP behave differently.

When a dataframe is cached, the key to the map that stores the cached objects 
is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second 
execution, when you go to check if the dataframe is cached you do the following 
check. 
{code:java}
df1.queryExecution.analyzed.canonicalized == 
df2.queryExecution.analyzed.canonicalized{code}
In this case, both execution plans are the same so it considers that it has the 
Dataframe cached and uses it

It seems a rather strange case in real life to have 2 identical Dataframes, one 
cached and one not, have a timestamp and do not want to reuse the cached 
Dataframe


was (Author: planga82):
I have been looking at the problem and I think I have understood the problem. 
What I don't understand is why Jupyter and ZP behave differently.

When a dataframe is cached, the key to the map that stores the cached objects 
is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second 
execution, when you go to check if the dataframe is cached you do the following 
check. 

df1.queryExecution.analyzed.canonicalized == 
df2.queryExecution.analyzed.canonicalized

In this case, both execution plans are the same so it considers that it has the 
Dataframe cached and uses it

It seems a rather strange case in real life to have 2 identical Dataframes, one 
cached and one not, have a timestamp and do not want to reuse the cached 
Dataframe

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203159#comment-17203159
 ] 

Pablo Langa Blanco commented on SPARK-32046:


Ok, I see, it make sense. I'll think about a possible solution but it seems 
difficult.

The possible work arround is to cache a Dataframe with a different plan each 
time. I think in something like this

 
{code:java}
val df1 = spark.range(1).select(current_timestamp as 
"datetime").withColumn("name", lit("df1"))
df1.cache
val df2 = spark.range(1).select(current_timestamp as 
"datetime").withColumn("name", lit("df2")){code}
 

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-09-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203040#comment-17203040
 ] 

Pablo Langa Blanco commented on SPARK-32046:


I have been looking at the problem and I think I have understood the problem. 
What I don't understand is why Jupyter and ZP behave differently.

When a dataframe is cached, the key to the map that stores the cached objects 
is a plan. (df1.queryExecution.analyzed.canonicalized). Then, in the second 
execution, when you go to check if the dataframe is cached you do the following 
check. 

df1.queryExecution.analyzed.canonicalized == 
df2.queryExecution.analyzed.canonicalized

In this case, both execution plans are the same so it considers that it has the 
Dataframe cached and uses it

It seems a rather strange case in real life to have 2 identical Dataframes, one 
cached and one not, have a timestamp and do not want to reuse the cached 
Dataframe

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-07-28 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166168#comment-17166168
 ] 

Pablo Langa Blanco commented on SPARK-32463:


Thanks [~huaxingao], I'll take a look on this

> Document Data Type inference rule in SQL reference
> --
>
> Key: SPARK-32463
> URL: https://issues.apache.org/jira/browse/SPARK-32463
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Document Data Type inference rule in SQL reference, under Data Types section. 
> Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32025) CSV schema inference with boolean & integer

2020-06-20 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141213#comment-17141213
 ] 

Pablo Langa Blanco commented on SPARK-32025:


I'm looking for the problem, as a workaround you can define the schema to avoid 
the bug on infer schema automatically

> CSV schema inference with boolean & integer 
> 
>
> Key: SPARK-32025
> URL: https://issues.apache.org/jira/browse/SPARK-32025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Brian Wallace
>Priority: Major
>
> I have a dataset consisting of two small files in CSV format. 
> {code:bash}
> $ cat /example/f0.csv
> col1
> 8589934592
> $ cat /example/f1.csv
> col1
> 4320
> true
> {code}
>  
> When I try and load this in (py)spark and infer schema, my expectation is 
> that the column is inferred to be a string. However, it is inferred as a 
> boolean:
> {code:python}
> spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True, 
> multiLine=True).show()
> ++
> |col1|
> ++
> |null|
> |true|
> |null|
> ++
> {code}
> Note that this seems to work correctly if multiLine is set to False (although 
> we need to set it to True as this column may indeed span multiple lines in 
> general).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119319#comment-17119319
 ] 

Pablo Langa Blanco commented on SPARK-31779:


How about using array_zip?
{code:java}
// code placeholder
val newTop = struct(df("top").getField("child1").alias("child1"), 
arrays_zip(df("top").getField("child2").getField("child2Second"),df("top").getField("child2").getField("child2Third")).alias("child2"))
{code}

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-23 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114866#comment-17114866
 ] 

Pablo Langa Blanco commented on SPARK-31779:


Hi [~jeff.w.evans] 

I think this is not a bug and this is the correct behavior. 

Your input has this structure:

 
{code:java}
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 |    |-- child1: long (nullable = true)
 |    |-- child2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- child2First: string (nullable = true)
 |    |    |    |-- child2Second: long (nullable = true)
{code}
When you do this df("top").getField("child2") you have an array element than 
have an struct inside

 
{code:java}
child2: array (nullable = true)
|-- element: struct (containsNull = true)
|    |-- child2First: string (nullable = true)
|    |-- child2Second: long (nullable = true)
{code}
When you do .getField("child2Second") over this structure you are accessing to 
the elements of the internal struct, but this is in an array so the return is 
an array with the element that you have selected

 
{code:java}
child2: array (nullable = true)
|-- child2Second: long (nullable = true)
{code}
 

With this example I think it’s more clear

 
{code:java}
val jsonData = """{
 |   "foo": "bar",
 |   "top": {
 |     "child1": 5,
 |     "child2": [
 |       {
 |         "child2First": "one",
 |         "child2Second": 2
 |       },
 |       {
 |         "child2First": "",
 |         "child2Second": 3
 |       }
 |     ]
 |   }
 | }"""
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
val newTop = df("top").getField("child2").getField("child2Second")
df.withColumn("top2", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 |    |-- child1: long (nullable = true)
 |    |-- child2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- child2First: string (nullable = true)
 |    |    |    |-- child2Second: long (nullable = true)
 |-- top2: array (nullable = true)
 |    |-- element: long (containsNull = true)
df.withColumn("top2", newTop).show(truncate=false)
+---+--+--+
|foo|top                       |top2  |
+---+--+--+
|bar|[5, [[one, 2], [, 3]]]|[2, 3]|
+---+--+--+
{code}
 

Looking at the code it is explicitly designed that way, so if you agree I think 
we should close the issue

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (SPARK-31561) Add QUALIFY Clause

2020-05-10 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco updated SPARK-31561:
---
Comment: was deleted

(was: I'm working on this)

> Add QUALIFY Clause
> --
>
> Key: SPARK-31561
> URL: https://issues.apache.org/jira/browse/SPARK-31561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> In a SELECT statement, the QUALIFY clause filters the results of window 
> functions.
> QUALIFY does with window functions what HAVING does with aggregate functions 
> and GROUP BY clauses.
> In the execution order of a query, QUALIFY is therefore evaluated after 
> window functions are computed.
> Examples:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples
> More details:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
> https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31561) Add QUALIFY Clause

2020-05-01 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097360#comment-17097360
 ] 

Pablo Langa Blanco commented on SPARK-31561:


I'm working on this

> Add QUALIFY Clause
> --
>
> Key: SPARK-31561
> URL: https://issues.apache.org/jira/browse/SPARK-31561
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> In a SELECT statement, the QUALIFY clause filters the results of window 
> functions.
> QUALIFY does with window functions what HAVING does with aggregate functions 
> and GROUP BY clauses.
> In the execution order of a query, QUALIFY is therefore evaluated after 
> window functions are computed.
> Examples:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html#examples
> More details:
> https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
> https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/19NnI91neorAi7LX6SJXBw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

2020-04-30 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096805#comment-17096805
 ] 

Pablo Langa Blanco commented on SPARK-31574:


Spark have a functionality to read multiple parquet files with different 
*compatible* schemas and by default is disabled.

[https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging]

The problem in the example you propose is that int and string are incompatible 
data types so merge schema is not going to work

> Schema evolution in spark while using the storage format as parquet
> ---
>
> Key: SPARK-31574
> URL: https://issues.apache.org/jira/browse/SPARK-31574
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: sharad Gupta
>Priority: Major
>
> Hi Team,
>  
> Use case:
> Suppose there is a table T1 with column C1 with datatype as int in schema 
> version 1. In the first on boarding table T1. I wrote couple of parquet files 
> with this schema version 1 with underlying file format used parquet.
> Now in schema version 2 the C1 column datatype changed to string from int. 
> Now It will write data with schema version 2 in parquet.
> So some parquet files are written with schema version 1 and some written with 
> schema version 2.
> Problem statement :
> 1. We are not able to execute the below command from spark sql
> ```Alter table Table T1 change C1 C1 string```
> 2. So as a solution i goto hive and alter the table change datatype because 
> it supported in hive then try to read the data in spark. So it is giving me 
> error
> ```
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
>   at 
> org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)```
>  
> 3. Suspecting that the underlying parquet file is written with integer type 
> and we are reading from a table whose column is changed to string type. So 
> that is why it is happening.
> How you can reproduce this:
> spark sql
> 1. Create a table from spark sql with one column with datatype as int with 
> stored as parquet.
> 2. Now put some data into table.
> 3. Now you can see the data if you select from table.
> Hive
> 1. change datatype from int to string by alter command
> 2. Now try to read data, You will be able to read the data here even after 
> changing the datatype.
> spark sql 
> 1. Try to read data from here now you will see the error.
> Now the question is how to solve schema evolution in spark while using the 
> storage format as parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation

2020-04-30 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096188#comment-17096188
 ] 

Pablo Langa Blanco commented on SPARK-31566:


Ok, perfect, thank you!

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31566) Add SQL Rest API Documentation

2020-04-29 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095881#comment-17095881
 ] 

Pablo Langa Blanco commented on SPARK-31566:


I'm taking a look on this. I think is useful to add this documentation

> Add SQL Rest API Documentation
> --
>
> Key: SPARK-31566
> URL: https://issues.apache.org/jira/browse/SPARK-31566
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> SQL Rest API exposes query execution metrics as Public API. Its documentation 
> will be useful for end-users. 
> {code:java}
> /applications/[app-id]/sql
> 1- A list of all queries for a given application.
> 2- ?details=[true|false (default)] lists metric details in addition to 
> queries details.
> 3- ?offset=[offset]=[len] lists queries in the given range.{code}
> {code:java}
> /applications/[app-id]/sql/[execution-id]
> 1- Details for the given query.
> 2- ?details=[true|false (default)] lists metric details in addition to given 
> query details.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-26 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092818#comment-17092818
 ] 

Pablo Langa Blanco commented on SPARK-31500:


[https://github.com/apache/spark/pull/28351]

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-24 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091833#comment-17091833
 ] 

Pablo Langa Blanco commented on SPARK-31500:


Hi [~ewasserman],

This is a scala base problem, equality between arrays is not behaving as 
expected.

[https://blog.bruchez.name/2013/05/scala-array-comparison-without-phd.html]

I'm going to work to find a solution, but here is a workaround, change the 
definition of the case class and put Seq instead of Array and it will work as 
expected.
{code:java}
case class R(id: String, value: String, bytes: Seq[Byte]){code}
 

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation

2020-04-13 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-31435:
--

 Summary: Add SPARK_JARS_DIR enviroment variable (new) to Spark 
configuration documentation
 Key: SPARK-31435
 URL: https://issues.apache.org/jira/browse/SPARK-31435
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Pablo Langa Blanco


Related with SPARK-31432

That issue introduces new environment variable that is documented in this issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31435) Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration documentation

2020-04-13 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082233#comment-17082233
 ] 

Pablo Langa Blanco commented on SPARK-31435:


I'm working on this

> Add SPARK_JARS_DIR enviroment variable (new) to Spark configuration 
> documentation
> -
>
> Key: SPARK-31435
> URL: https://issues.apache.org/jira/browse/SPARK-31435
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> Related with SPARK-31432
> That issue introduces new environment variable that is documented in this 
> issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30077) create TEMPORARY VIEW USING should look up catalog/table like v2 commands

2019-12-13 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995945#comment-16995945
 ] 

Pablo Langa Blanco commented on SPARK-30077:


[~huaxingao] Can we close this ticket? Reading the comments I undestand we are 
not going to do this change. Thanks

> create TEMPORARY VIEW USING should look up catalog/table like v2 commands
> -
>
> Key: SPARK-30077
> URL: https://issues.apache.org/jira/browse/SPARK-30077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> create TEMPORARY VIEW USING should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29563) CREATE TABLE LIKE should look up catalog/table like v2 commands

2019-12-13 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995942#comment-16995942
 ] 

Pablo Langa Blanco commented on SPARK-29563:


[~dkbiswal] Are you still working on this? If not, I can continue. Thanks

> CREATE TABLE LIKE should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-29563
> URL: https://issues.apache.org/jira/browse/SPARK-29563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30040) DROP FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982215#comment-16982215
 ] 

Pablo Langa Blanco commented on SPARK-30040:


I'm working on this

> DROP FUNCTION should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-30040
> URL: https://issues.apache.org/jira/browse/SPARK-30040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> DROP FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30040) DROP FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-30040:
--

 Summary: DROP FUNCTION should look up catalog/table like v2 
commands
 Key: SPARK-30040
 URL: https://issues.apache.org/jira/browse/SPARK-30040
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


DROP FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982214#comment-16982214
 ] 

Pablo Langa Blanco commented on SPARK-30039:


I am working on this

>  CREATE FUNCTION should look up catalog/table like v2 commands
> --
>
> Key: SPARK-30039
> URL: https://issues.apache.org/jira/browse/SPARK-30039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
>  CREATE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-30039:
--

 Summary:  CREATE FUNCTION should look up catalog/table like v2 
commands
 Key: SPARK-30039
 URL: https://issues.apache.org/jira/browse/SPARK-30039
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


 CREATE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982213#comment-16982213
 ] 

Pablo Langa Blanco commented on SPARK-30038:


I am working on this

> DESCRIBE FUNCTION should look up catalog/table like v2 commands
> ---
>
> Key: SPARK-30038
> URL: https://issues.apache.org/jira/browse/SPARK-30038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> DESCRIBE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30038) DESCRIBE FUNCTION should look up catalog/table like v2 commands

2019-11-25 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-30038:
--

 Summary: DESCRIBE FUNCTION should look up catalog/table like v2 
commands
 Key: SPARK-30038
 URL: https://issues.apache.org/jira/browse/SPARK-30038
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


DESCRIBE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands

2019-11-15 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29922:
--

 Summary: SHOW FUNCTIONS should look up catalog/table like v2 
commands
 Key: SPARK-29922
 URL: https://issues.apache.org/jira/browse/SPARK-29922
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW FUNCTIONS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29922) SHOW FUNCTIONS should look up catalog/table like v2 commands

2019-11-15 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975272#comment-16975272
 ] 

Pablo Langa Blanco commented on SPARK-29922:


I'm working on this

> SHOW FUNCTIONS should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29922
> URL: https://issues.apache.org/jira/browse/SPARK-29922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW FUNCTIONS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-10 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971191#comment-16971191
 ] 

Pablo Langa Blanco commented on SPARK-29829:


I am working on this

> SHOW TABLE EXTENDED should look up catalog/table like v2 commands
> -
>
> Key: SPARK-29829
> URL: https://issues.apache.org/jira/browse/SPARK-29829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29829:
--

 Summary: SHOW TABLE EXTENDED should look up catalog/table like v2 
commands
 Key: SPARK-29829
 URL: https://issues.apache.org/jira/browse/SPARK-29829
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29523) SHOW COLUMNS should look up catalog/table like v2 commands

2019-10-20 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955422#comment-16955422
 ] 

Pablo Langa Blanco commented on SPARK-29523:


I am working on this

> SHOW COLUMNS should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29523
> URL: https://issues.apache.org/jira/browse/SPARK-29523
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW COLUMNS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29523) SHOW COLUMNS should look up catalog/table like v2 commands

2019-10-20 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29523:
--

 Summary: SHOW COLUMNS should look up catalog/table like v2 commands
 Key: SPARK-29523
 URL: https://issues.apache.org/jira/browse/SPARK-29523
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW COLUMNS should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands

2019-10-19 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955207#comment-16955207
 ] 

Pablo Langa Blanco commented on SPARK-29519:


I am working on this

> SHOW TBLPROPERTIES should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29519
> URL: https://issues.apache.org/jira/browse/SPARK-29519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW TBLPROPERTIES should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands

2019-10-19 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29519:
--

 Summary: SHOW TBLPROPERTIES should look up catalog/table like v2 
commands
 Key: SPARK-29519
 URL: https://issues.apache.org/jira/browse/SPARK-29519
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW TBLPROPERTIES should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29433) Web UI Stages table tooltip correction

2019-10-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29433:
--

 Summary: Web UI Stages table tooltip correction
 Key: SPARK-29433
 URL: https://issues.apache.org/jira/browse/SPARK-29433
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


In the Web UI, Stages table, the tool tip of Input and output column are not 
corrrect.

Actual tooltip messages: 
 * Bytes and records read from Hadoop or from Spark storage.
 * Bytes and records written to Hadoop.

In this column we are only showing bytes, not records

More information at the pull request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29431) Improve Web UI / Sql tab visualization with cached dataframes.

2019-10-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29431:
--

 Summary: Improve Web UI / Sql tab visualization with cached 
dataframes.
 Key: SPARK-29431
 URL: https://issues.apache.org/jira/browse/SPARK-29431
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


When the Spark plan has a cached dataframe, all the plan of the cached 
dataframe is not been shown in the SQL tree. 

More info at the pull request. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29019) Improve tooltip information in JDBC/ODBC Server tab

2019-09-08 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29019:
--

 Summary: Improve tooltip information in JDBC/ODBC Server tab
 Key: SPARK-29019
 URL: https://issues.apache.org/jira/browse/SPARK-29019
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand.

We have documented it at SPARK-28373 but I think it is better to have some 
tooltips in the SQL statistics table to explain the columns

More information at the pull request



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28372) Document Spark WEB UI

2019-09-02 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921026#comment-16921026
 ] 

Pablo Langa Blanco commented on SPARK-28372:


[~smilegator] We have a streaming tab section in the documentation with little 
documentation. Do you think we should open a new issue to complete this?

> Document Spark WEB UI
> -
>
> Key: SPARK-28372
> URL: https://issues.apache.org/jira/browse/SPARK-28372
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Spark web UIs are being used to monitor the status and resource consumption 
> of your Spark applications and clusters. However, we do not have the 
> corresponding document. It is hard for end users to use and understand them. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-09-02 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921019#comment-16921019
 ] 

Pablo Langa Blanco commented on SPARK-28373:


[~podongfeng] Ok i can take care of it 

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28542) Document Stages page

2019-08-18 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909985#comment-16909985
 ] 

Pablo Langa Blanco commented on SPARK-28542:


I want to contribution on this if you are not working on it

[~podongfeng] 

 

 

> Document Stages page
> 
>
> Key: SPARK-28542
> URL: https://issues.apache.org/jira/browse/SPARK-28542
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28543) Document Spark Jobs page

2019-08-13 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905988#comment-16905988
 ] 

Pablo Langa Blanco commented on SPARK-28543:


Hi [~podongfeng], i have seen it when i do the pull resquest yesterday. The new 
information is integrated in your main page.

Thanks!!

> Document Spark Jobs page
> 
>
> Key: SPARK-28543
> URL: https://issues.apache.org/jira/browse/SPARK-28543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28543) Document Spark Jobs page

2019-08-03 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899482#comment-16899482
 ] 

Pablo Langa Blanco commented on SPARK-28543:


Hi! I want to contribute on this.

> Document Spark Jobs page
> 
>
> Key: SPARK-28543
> URL: https://issues.apache.org/jira/browse/SPARK-28543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-08-03 Thread Pablo Langa Blanco (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco closed SPARK-28585.
--

> Improve WebUI DAG information: Add extra info to rdd from spark plan
> 
>
> Key: SPARK-28585
> URL: https://issues.apache.org/jira/browse/SPARK-28585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The mainly improve that i want to achieve is to help developers to explore 
> the DAG information in the Web UI in complex flows. 
> Sometimes is very dificult to know what part of your spark plan corresponds 
> to the DAG you are looking.
>  
> This is an initial improvement only in one simple spark plan type 
> (UnionExec). 
> If you consider it a good idea, i want to extend it to other spark plans to 
> improve the visualization iteratively.
>  
> More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-08-03 Thread Pablo Langa Blanco (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-28585.

Resolution: Won't Fix

> Improve WebUI DAG information: Add extra info to rdd from spark plan
> 
>
> Key: SPARK-28585
> URL: https://issues.apache.org/jira/browse/SPARK-28585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The mainly improve that i want to achieve is to help developers to explore 
> the DAG information in the Web UI in complex flows. 
> Sometimes is very dificult to know what part of your spark plan corresponds 
> to the DAG you are looking.
>  
> This is an initial improvement only in one simple spark plan type 
> (UnionExec). 
> If you consider it a good idea, i want to extend it to other spark plans to 
> improve the visualization iteratively.
>  
> More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28585) Improve WebUI DAG information: Add extra info to rdd from spark plan

2019-07-31 Thread Pablo Langa Blanco (JIRA)
Pablo Langa Blanco created SPARK-28585:
--

 Summary: Improve WebUI DAG information: Add extra info to rdd from 
spark plan
 Key: SPARK-28585
 URL: https://issues.apache.org/jira/browse/SPARK-28585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


The mainly improve that i want to achieve is to help developers to explore the 
DAG information in the Web UI in complex flows. 

Sometimes is very dificult to know what part of your spark plan corresponds to 
the DAG you are looking.

 

This is an initial improvement only in one simple spark plan type (UnionExec). 

If you consider it a good idea, i want to extend it to other spark plans to 
improve the visualization iteratively.

 

More info in the pull request 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26648) Noop Streaming Sink using DSV2

2019-01-17 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745506#comment-16745506
 ] 

Pablo Langa Blanco commented on SPARK-26648:


Duplicated with SPARK-26649?

> Noop Streaming Sink using DSV2
> --
>
> Key: SPARK-26648
> URL: https://issues.apache.org/jira/browse/SPARK-26648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Extend this noop data source to support a streaming sink that ignores all the 
> elements.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24701) SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather than different ports per app

2019-01-07 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736366#comment-16736366
 ] 

Pablo Langa Blanco commented on SPARK-24701:


Hi [~toopt4]

i think you have this functionality in the history server. For each application 
spark starts new server (in a new port) and you have the history server up only 
in one port monitoring all the applications.

As you can see ([https://spark.apache.org/docs/latest/monitoring.html)] "The 
history server displays both completed and incomplete Spark jobs"  "Incomplete 
applications are only updated intermittently" spark.history.fs.update.interval

it could be a solution for you?

> SparkMaster WebUI allow all appids to be shown in detail on port 4040 rather 
> than different ports per app
> -
>
> Key: SPARK-24701
> URL: https://issues.apache.org/jira/browse/SPARK-24701
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
>  Labels: master, security, ui, web, web-ui
> Attachments: spark_ports.png
>
>
> Right now the detail for all application ids are shown on a diff port per app 
> id, ie. 4040, 4041, 4042...etc this is problematic for environments with 
> tight firewall settings. Proposing to allow 4040?appid=1,  4040?appid=2,  
> 4040?appid=3..etc for the master web ui just like what the History Web UI 
> does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement

2019-01-07 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736087#comment-16736087
 ] 

Pablo Langa Blanco commented on SPARK-26518:


i was trying to find and easy solution to this but i dont find it. The problem 
is not only in that point because when you control this error thre are other 
elements that are not loaded yet and fails too. Other thing that i was trying 
to do is to find a point to know if the KVStore is load and show a message 
instead the error, but its no easy. So, as it is an issue whit trivial priority 
its no reasonable to change a lot of things. 

> UI Application Info Race Condition Can Throw NoSuchElement
> --
>
> Key: SPARK-26518
> URL: https://issues.apache.org/jira/browse/SPARK-26518
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Russell Spitzer
>Priority: Trivial
>
> There is a slight race condition in the 
> [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39]
> Which calls `next` on the returned store even if it is empty which i can be 
> for a short period of time after the UI is up but before the store is 
> populated.
> {code}
> 
> 
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server ErrorCaused 
> by:java.util.NoSuchElementException
> at java.util.Collections$EmptyIterator.next(Collections.java:4189)
> at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
> at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
> at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.spark_project.jetty.server.Server.handle(Server.java:531)
> at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at 
> org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-07 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736024#comment-16736024
 ] 

Pablo Langa Blanco commented on SPARK-26457:


Ok, I don't know that you are working on it. i'm going to comment on your pull 
request

Thanks!!

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26518) UI Application Info Race Condition Can Throw NoSuchElement

2019-01-06 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735135#comment-16735135
 ] 

Pablo Langa Blanco commented on SPARK-26518:


Hi

I'm looking into it. 

Pablo

> UI Application Info Race Condition Can Throw NoSuchElement
> --
>
> Key: SPARK-26518
> URL: https://issues.apache.org/jira/browse/SPARK-26518
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Russell Spitzer
>Priority: Trivial
>
> There is a slight race condition in the 
> [AppStatusStore|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala#L39]
> Which calls `next` on the returned store even if it is empty which i can be 
> for a short period of time after the UI is up but before the store is 
> populated.
> {code}
> 
> 
> Error 500 Server Error
> 
> HTTP ERROR 500
> Problem accessing /jobs/. Reason:
> Server ErrorCaused 
> by:java.util.NoSuchElementException
> at java.util.Collections$EmptyIterator.next(Collections.java:4189)
> at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
> at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
> at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:86)
> at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.spark_project.jetty.server.Server.handle(Server.java:531)
> at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at 
> org.spark_project.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-05 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734986#comment-16734986
 ] 

Pablo Langa Blanco commented on SPARK-26457:


Hi [~deshanxiao],

What configurations are you thinking about? Could you explain cases where this 
information could be relevant. I'm thinking in the case that you are working 
with yarn, yarn has all the information about hadoop that we could need about 
the spark job, you dont need it duplicated in History server. 

Thanks!

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25917) Spark UI's executors page loads forever when memoryMetrics in None. Fix is to JSON ignore memorymetrics when it is None.

2019-01-05 Thread Pablo Langa Blanco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734984#comment-16734984
 ] 

Pablo Langa Blanco commented on SPARK-25917:


The pull request was closed because the problem has been solved already, the 
issue should be closed too.

> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> 
>
> Key: SPARK-25917
> URL: https://issues.apache.org/jira/browse/SPARK-25917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: Rong Tang
>Priority: Major
>
> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> ## How was this patch tested?
> Before fix: (loads forever)
> ![image](https://user-images.githubusercontent.com/1785565/47875681-64dfe480-ddd4-11e8-8d15-5ed1457bc24f.png)
> After fix:
> ![image](https://user-images.githubusercontent.com/1785565/47875691-6b6e5c00-ddd4-11e8-9895-db8dd9730ee1.png)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >