[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2023-11-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781966#comment-17781966
 ] 

Abhinav Kumar commented on SPARK-36786:
---

[~ashahid7] [~adou...@sqli.com] where are we on this one?

> SPIP: Improving the compile time performance, by improving  a couple of 
> rules, from 24 hrs to under 8 minutes
> -
>
> Key: SPARK-36786
> URL: https://issues.apache.org/jira/browse/SPARK-36786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in 
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when* 
> expressions ( level of nesting could be >  8) , where each *case when* can 
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
>     |
>    Filter 1
>     |
> Project2
>     |
>  Filter2
>     |
>  Project3
>     |
>  Filter3
>   |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition 
> is satisfied*
> *i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
> achieved)*
> One of the early  Rule is   *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>  
> The first issue is *PushDownPredicate* rule.
> It picks  one filter at a time & pushes it at lowest level ( I understand 
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but 
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines 
> filters. so it does push now all the encountered filters in a single pass. 
> But it still materializes the filter on each push by realiasing.*
> So if there are say  50 projects interspersed with Filters , the idempotency 
> is guaranteedly not going to get achieved till around 49 iterations. 
> Moreover, CollapseProject will also be modifying tree on each iteration as a 
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is 
> re-aliased using transformUp rule.  transformUp is very expensive compared to 
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
>  # Instead of pushing one filter at a time,  collect all the filters as we 
> traverse the tree in that iteration itself.
>  # Do not re-alias the filters on each push. Collect the sequence of projects 
> it has passed through, and  when the filters have reached their resting 
> place, do the re-alias by processing the projects collected in down to up 
> manner.
> This will result in achieving idempotency in a couple of iterations. 
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals 
> ( ... there are around 6 more such rules)*  which traverse the tree using 
> transformUp, and they run unnecessarily in each iteration , even when the 
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the 
> above rules from running unnecessarily, if it can be guaranteed that the 
> expression is not going to mutate in the operator.* 
> The cause of our huge compilation time has been identified as the above.
>   
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a 
> time  & at each Project , it materialized the re-aliased filter.  This 
> results in large number of iterations to achieve idempotency as well as 
> immediate materialization of Filter after each Project pass,, results in 
> unnecessary tree traversals of filter expression that too using transformUp. 
> and the expression tree of filter is bound to keep increasing as it is pushed 
> down.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> In the new approach we push all the filters down in a single pass. And do not 
> materialize filters as it pass through Project. Instead keep collecting 
> projects in sequential order and materialize the final filter once its final 
> 

[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

2023-11-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781959#comment-17781959
 ] 

Abhinav Kumar commented on SPARK-33164:
---

I see value in some use cases like [~arnaud.nauwynck] mentions. But there is 
this "SELECT *" very well documented risks, leading to maintainability issues. 
Should we still be trying to implement this?

> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> 
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>  select count(*) countCol, * 
>from MyTable 
>where ... 
>group by ... having countCol = 1
>   )
> {noformat}
>
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>SELECT *,
>   RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>FROM t_updates
>where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
>/* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>CREATE TABLE t_snapshot
>AS SELECT '' as snap_id, * FROM t WHERE 1=2
>/* insert data from t to some snapshot */
>INSERT INTO t_snapshot
>SELECT 'snap1' as snap_id, * from t 
>
>/* select some data from snapshot table (without snap_id column) .. */   
>SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your 
> code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), 
> Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet 
> API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL 
> editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it 
> relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform 
> in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.



--
This message was sent by 

[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures

2023-10-22 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778411#comment-17778411
 ] 

Abhinav Kumar commented on SPARK-45023:
---

Based on this, I will put another SPIP to create a grouped SQL as stored 
procedure and see what the community thinks.

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection

2023-10-18 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776976#comment-17776976
 ] 

Abhinav Kumar commented on SPARK-44817:
---

[~rakson] [~gurwls223] [~cloud_fan] - We find this issue quite common. 
Currently, the incremental stats collection is done mostly outside the spark 
application as a end of day process (to avoid SLA breaches), and sometimes 
within the current application, if DML materially changes the stats. This 
proposal seems like a good idea, consider users can control it via spark 
parameter.

Views?

> SPIP: Incremental Stats Collection
> --
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures

2023-10-18 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776963#comment-17776963
 ] 

Abhinav Kumar commented on SPARK-45023:
---

Not sure where we are with this. Looks like we are not progressing with this. I 
do see value of SQL based Stored Procedure (to begin with just grouped sqls) - 
user can reveal the intent of usage and Spark can optimize holistically. Should 
we discuss and modify proposal accordingly? Please suggest.

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45438) Decimal precision exceeds max precision error when using unary minus on min Decimal values on Scala 2.13 Spark

2023-10-06 Thread Navin Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated SPARK-45438:

Summary: Decimal precision exceeds max precision error when using unary 
minus on min Decimal values on Scala 2.13 Spark  (was: Decimal precision 
exceeds max precision error when using unary minus on min Decimal values)

> Decimal precision exceeds max precision error when using unary minus on min 
> Decimal values on Scala 2.13 Spark
> --
>
> Key: SPARK-45438
> URL: https://issues.apache.org/jira/browse/SPARK-45438
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 
> 3.3.2, 3.4.0, 3.4.1, 3.5.0
>Reporter: Navin Kumar
>Priority: Major
>  Labels: scala
>
> When submitting an application to Spark built with Scala 2.13, there are 
> issues with Decimal overflow that show up when using unary minus (and also 
> {{abs()}} which uses unary minus under the hood.
> Here is an example PySpark reproduce use case:
> {code}
> from decimal import Decimal
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType,StructField, DecimalType
> spark = SparkSession.builder \
>   .master("local[*]") \
>   .appName("decimal_precision") \
>   .config("spark.rapids.sql.explain", "ALL") \
>   .config("spark.sql.ansi.enabled", "true") \
>   .config("spark.sql.legacy.allowNegativeScaleOfDecimal", 'true') \
>   .getOrCreate()  
> precision = 38
> scale = 0
> DECIMAL_MIN = Decimal('-' + ('9' * precision) + 'e' + str(-scale))
> data = [[DECIMAL_MIN]]
> schema = StructType([
> StructField("a", DecimalType(precision, scale), True)])
> df = spark.createDataFrame(data=data, schema=schema)
> df.selectExpr("a", "-a").show()
> {code}
> This particular example will run successfully on Spark built with Scala 2.12, 
> but throw a java.math.ArithmeticException on Spark built with Scala 2.13. 
> If you change the value of {{DECIMAL_MIN}} in the previous code to something 
> just ahead of the original DECIMAL_MIN, you will not get an exception thrown, 
> but instead you will get an incorrect answer (possibly due to overflow):
> {code}
> ...
> DECIMAL_MIN = Decimal('-8' + ('9' * (precision-1)) + 'e' + str(-scale))
> ...
> {code} 
> Output:
> {code}
> +++
> |   a|   (- a)|
> +++
> |-8999...|9...|
> +++
> {code}
> It looks like the code in {{Decimal.scala}} uses {{scala.math.BigDecimal}}. 
> See https://github.com/scala/bug/issues/11590 with updates on how Scala 2.13 
> handles BigDecimal. It looks like there is {{java.math.MathContext}} missing 
> when performing these operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45438) Decimal precision exceeds max precision error when using unary minus on min Decimal values

2023-10-06 Thread Navin Kumar (Jira)
Navin Kumar created SPARK-45438:
---

 Summary: Decimal precision exceeds max precision error when using 
unary minus on min Decimal values
 Key: SPARK-45438
 URL: https://issues.apache.org/jira/browse/SPARK-45438
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.4.0, 3.3.2, 3.3.3, 3.2.4, 3.2.3, 3.3.1, 
3.2.2, 3.3.0, 3.2.1, 3.2.0
Reporter: Navin Kumar


When submitting an application to Spark built with Scala 2.13, there are issues 
with Decimal overflow that show up when using unary minus (and also {{abs()}} 
which uses unary minus under the hood.

Here is an example PySpark reproduce use case:

{code}
from decimal import Decimal

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, DecimalType

spark = SparkSession.builder \
  .master("local[*]") \
  .appName("decimal_precision") \
  .config("spark.rapids.sql.explain", "ALL") \
  .config("spark.sql.ansi.enabled", "true") \
  .config("spark.sql.legacy.allowNegativeScaleOfDecimal", 'true') \
  .getOrCreate()  

precision = 38
scale = 0
DECIMAL_MIN = Decimal('-' + ('9' * precision) + 'e' + str(-scale))

data = [[DECIMAL_MIN]]

schema = StructType([
StructField("a", DecimalType(precision, scale), True)])
df = spark.createDataFrame(data=data, schema=schema)

df.selectExpr("a", "-a").show()
{code}

This particular example will run successfully on Spark built with Scala 2.12, 
but throw a java.math.ArithmeticException on Spark built with Scala 2.13. 

If you change the value of {{DECIMAL_MIN}} in the previous code to something 
just ahead of the original DECIMAL_MIN, you will not get an exception thrown, 
but instead you will get an incorrect answer (possibly due to overflow):

{code}
...
DECIMAL_MIN = Decimal('-8' + ('9' * (precision-1)) + 'e' + str(-scale))
...
{code} 

Output:
{code}
+++
|   a|   (- a)|
+++
|-8999...|9...|
+++
{code}

It looks like the code in {{Decimal.scala}} uses {{scala.math.BigDecimal}}. See 
https://github.com/scala/bug/issues/11590 with updates on how Scala 2.13 
handles BigDecimal. It looks like there is {{java.math.MathContext}} missing 
when performing these operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 
{code:java}
val USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE =
    buildConf("spark.sql.hive.filecommit.dynamicPartitionOverwrite"){code}
 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 

 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
> 

[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use  built-in FileCommitProtocol 
instead of Hadoop FileOutputCommitter , which is more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code.  

enableDynamicPartitionOverwrite 

 
{code:java}
 val enableDynamicPartitionOverwrite =
      
SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
    logWarning(s"enableDynamicPartitionOverwrite: 
$enableDynamicPartitionOverwrite"){code}
 

 

Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 and 
overwrite is true , pass dynamicPartitionOverwrite true. 

 
{code:java}
val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
bucketSpec, options = options, dynamicPartitionOverwrite =
        enableDynamicPartitionOverwrite && numDynamicPartitions > 0 && 
overwrite)       {code}
 

 

In saveAs File 
{code:java}
val committer = FileCommitProtocol.instantiate(
      sparkSession.sessionState.conf.fileCommitProtocolClass,
      jobId = java.util.UUID.randomUUID().toString,
      outputPath = outputLocation,
      dynamicPartitionOverwrite = dynamicPartitionOverwrite) {code}
This will internal call  with dynamicPartitionOverwrite value true. 

 
{code:java}
class SQLHadoopMapReduceCommitProtocol(
jobId: String,
path: String,
dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, path, dynamicPartitionOverwrite) 
{code}
 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code. 

 

 

 

 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Currently in InsertIntoHiveTable.scala , there is no way to pass 
> dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
> dynamicPartitioOverwrite is true , spark will use  built-in 
> FileCommitProtocol instead of Hadoop FileOutputCommitter , which is more 
> performant. 
>  
> Here is the solution . 
> When inserting overwrite into Hive table
>  
> Current code 
>  
> {code:java}
> val writtenParts = saveAsHiveFile(
>   sparkSession = sparkSession,
>   plan = child,
>   hadoopConf = hadoopConf,
>   fileFormat = fileFormat,
>   outputLocation = tmpLocation.toString,
>   partitionAttributes = partitionColumns,
>   bucketSpec = bucketSpec,
>   options = options)
>        {code}
>  
>  
> Proposed code.  
> enableDynamicPartitionOverwrite 
>  
> {code:java}
>  val enableDynamicPartitionOverwrite =
>       
> SQLConf.get.getConf(HiveUtils.USE_FILECOMMITPROTOCOL_DYNAMIC_PARTITION_OVERWRITE)
>     logWarning(s"enableDynamicPartitionOverwrite: 
> $enableDynamicPartitionOverwrite"){code}
>  
>  
> Now if enableDynamicPartitionOverwrite is true and numDynamicPartitions > 0 
> and overwrite is true , pass dynamicPartitionOverwrite true. 
>  
> {code:java}
> val writtenParts = saveAsHiveFile( sparkSession = sparkSession, plan = child, 
> hadoopConf = hadoopConf, fileFormat = fileFormat, outputLocation = 
> tmpLocation.toString, partitionAttributes = partitionColumns, bucketSpec = 
> bucketSpec, options = options, dynamicPartitionOverwrite =
>         

[jira] [Updated] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-44166:
--
Description: 
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 

 

Here is the solution . 

When inserting overwrite into Hive table

 

Current code 

 
{code:java}
val writtenParts = saveAsHiveFile(
  sparkSession = sparkSession,
  plan = child,
  hadoopConf = hadoopConf,
  fileFormat = fileFormat,
  outputLocation = tmpLocation.toString,
  partitionAttributes = partitionColumns,
  bucketSpec = bucketSpec,
  options = options)
       {code}
 

 

Proposed code. 

 

 

 

 

  was:
Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 


> Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite
> ---
>
> Key: SPARK-44166
> URL: https://issues.apache.org/jira/browse/SPARK-44166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Currently in InsertIntoHiveTable.scala , there is no way to pass 
> dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
> dynamicPartitioOverwrite is true , spark will use 
> built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
> more performant. 
>  
> Here is the solution . 
> When inserting overwrite into Hive table
>  
> Current code 
>  
> {code:java}
> val writtenParts = saveAsHiveFile(
>   sparkSession = sparkSession,
>   plan = child,
>   hadoopConf = hadoopConf,
>   fileFormat = fileFormat,
>   outputLocation = tmpLocation.toString,
>   partitionAttributes = partitionColumns,
>   bucketSpec = bucketSpec,
>   options = options)
>        {code}
>  
>  
> Proposed code. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44166) Enable dynamicPartitionOverwrite in SaveAsHiveFile for insert overwrite

2023-06-24 Thread Pralabh Kumar (Jira)
Pralabh Kumar created SPARK-44166:
-

 Summary: Enable dynamicPartitionOverwrite in SaveAsHiveFile for 
insert overwrite
 Key: SPARK-44166
 URL: https://issues.apache.org/jira/browse/SPARK-44166
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Pralabh Kumar


Currently in InsertIntoHiveTable.scala , there is no way to pass 
dynamicPartitionOverwrite to true , when calling  saveAsHiveFile . When 
dynamicPartitioOverwrite is true , spark will use 
built-in FileCommitProtocol instead of Hadoop FileOutputCommitter , which is 
more performant. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44159) Commands for writting (InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable) should log what they are doing

2023-06-23 Thread Navin Kumar (Jira)
Navin Kumar created SPARK-44159:
---

 Summary: Commands for writting (InsertIntoHadoopFsRelationCommand 
and InsertIntoHiveTable) should log what they are doing
 Key: SPARK-44159
 URL: https://issues.apache.org/jira/browse/SPARK-44159
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Navin Kumar


Improvements from SPARK-41763 decoupled the execution of create table and data 
writing commands in a CTAS (see SPARK-41713).

This means that while the code is cleaner with v1 write implementation limited 
to InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable, the execution of 
these operations is less clear than it was before. Previously, the command was 
present in the physical plan (see explain output below):
 
{{== Physical Plan ==}}
{{CommandResult }}
{{+- Execute CreateHiveTableAsSelectCommand [Database: default, TableName: 
test_hive_text_table, InsertIntoHiveTable]}}
{{+- *(1) Scan ExistingRDD[...]}}

But in Spark 3.4.0, this output is:

{{== Physical Plan ==}}
{{CommandResult }}
{{+- Execute CreateHiveTableAsSelectCommand}}
{{+- CreateHiveTableAsSelectCommand [Database: default, TableName: 
test_hive_text_table]}}
{{+- Project [...]}}
{{+- SubqueryAlias hive_input_table}}
{{+- View (`hive_input_table`, [...])}}
{{+- LogicalRDD [...], false}}

And the write command is now missing. This makes sense since execution is 
decoupled, but since there is no log output from InsertIntoHiveTable, there is 
no clear way to fully know that the command actually executed. 

I would propose that either these commands should add a log message at the INFO 
level that indicates how many rows were written into what table to make easier 
for a user to know what has happened from the Spark logs. Another option maybe 
to update the explain output in Spark 3.4 to handle this, but that might be 
more difficult and make less sense since the operations are now decoupled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-05-09 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720979#comment-17720979
 ] 

Pralabh Kumar commented on SPARK-43235:
---

can any one please look into this . If ok I can create PR for it . 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache))
> { LocalResourceVisibility.PUBLIC }
> else
> { LocalResourceVisibility.PRIVATE }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-05-01 Thread Pralabh Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pralabh Kumar updated SPARK-43235:
--
Description: 
Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache))

{ LocalResourceVisibility.PUBLIC }

else

{ LocalResourceVisibility.PRIVATE }

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 

  was:
Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache)) {
LocalResourceVisibility.PUBLIC
} else {
LocalResourceVisibility.PRIVATE
}

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 


> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache))
> { LocalResourceVisibility.PUBLIC }
> else
> { LocalResourceVisibility.PRIVATE }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-05-01 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718163#comment-17718163
 ] 

Pralabh Kumar commented on SPARK-43235:
---

Gentle ping to review . I can create a PR for the same 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache)) {
> LocalResourceVisibility.PUBLIC
> } else {
> LocalResourceVisibility.PRIVATE
> }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-28 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717566#comment-17717566
 ] 

Pralabh Kumar commented on SPARK-43235:
---

[~gurwls223] Can u please look into this . 

> ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE 
> if isPublic throws exception
> --
>
> Key: SPARK-43235
> URL: https://issues.apache.org/jira/browse/SPARK-43235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Pralabh Kumar
>Priority: Minor
>
> Hi Spark Team .
> Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
> whether resource visibility can be set to private or public. 
> In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
> of all the ancestors directories for the executable directory . It goes till 
> the root folder to check permission of all the parents 
> (ancestorsHaveExecutePermissions) 
> checkPermissionOfOther calls  FileStatus getFileStatus to check the 
> permission .
> If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
> didn't sets the permission to Private.
> if (isPublic(conf, uri, statCache)) {
> LocalResourceVisibility.PUBLIC
> } else {
> LocalResourceVisibility.PRIVATE
> }
> Generally if the user doesn't have permission to check for root folder 
> (specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
> throws error IOException(Error accessing Bucket).
>  
> *Ideally if there is an error in isPublic , which means Spark isn't able to 
> determine the execution permission of all the parents directory , it should 
> set the LocalResourceVisibility.PRIVATE.  However, it currently throws an 
> exception in isPublic and hence Spark Submit fails*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43235) ClientDistributedCacheManager doesn't set the LocalResourceVisibility.PRIVATE if isPublic throws exception

2023-04-22 Thread Pralabh Kumar (Jira)
Pralabh Kumar created SPARK-43235:
-

 Summary: ClientDistributedCacheManager doesn't set the 
LocalResourceVisibility.PRIVATE if isPublic throws exception
 Key: SPARK-43235
 URL: https://issues.apache.org/jira/browse/SPARK-43235
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Pralabh Kumar


Hi Spark Team .

Currently *ClientDistributedCacheManager* *getVisibility* methods checks 
whether resource visibility can be set to private or public. 

In order to set  *LocalResourceVisibility.PUBLIC* ,isPublic checks permission 
of all the ancestors directories for the executable directory . It goes till 
the root folder to check permission of all the parents 
(ancestorsHaveExecutePermissions) 

checkPermissionOfOther calls  FileStatus getFileStatus to check the permission .

If the   FileStatus getFileStatus throws exception Spark Submit fails . It 
didn't sets the permission to Private.

if (isPublic(conf, uri, statCache)) {
LocalResourceVisibility.PUBLIC
} else {
LocalResourceVisibility.PRIVATE
}

Generally if the user doesn't have permission to check for root folder 
(specifically in case of cloud file system(GCS)  (for the buckets)  , methods 
throws error IOException(Error accessing Bucket).

 

*Ideally if there is an error in isPublic , which means Spark isn't able to 
determine the execution permission of all the parents directory , it should set 
the LocalResourceVisibility.PRIVATE.  However, it currently throws an exception 
in isPublic and hence Spark Submit fails*

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2023-01-15 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677198#comment-17677198
 ] 

Pralabh Kumar commented on SPARK-36728:
---

[~gurwls223] I think this can be closed , as its fixed part of 
 # SPARK-36742

> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt, pyspark_date2.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys.append(key)
> KeyError: "['testyear'] not in index"
> df_test
> testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 
> 30 25



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2023-01-15 Thread Pralabh Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677187#comment-17677187
 ] 

Pralabh Kumar commented on SPARK-36728:
---

I think this issue is not reproducible on Spark 3.4. Please confirm 

> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt, pyspark_date2.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys.append(key)
> KeyError: "['testyear'] not in index"
> df_test
> testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 
> 30 25



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41232) High-order function: array_append

2022-11-30 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17641272#comment-17641272
 ] 

Senthil Kumar commented on SPARK-41232:
---

[~podongfeng] Shall I work on this?

> High-order function: array_append
> -
>
> Key: SPARK-41232
> URL: https://issues.apache.org/jira/browse/SPARK-41232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23497) Sparklyr Applications doesn't disconnect spark driver in client mode

2022-11-16 Thread bharath kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635096#comment-17635096
 ] 

bharath kumar commented on SPARK-23497:
---

This is no longer a problem, tested recently with sagemaker and livy with 
rstudio.

> Sparklyr Applications doesn't disconnect spark driver in client mode
> 
>
> Key: SPARK-23497
> URL: https://issues.apache.org/jira/browse/SPARK-23497
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: bharath kumar
>Priority: Major
>
> Hello,
> When we use Sparklyr to connect to Yarn cluster manager in client mode or 
> cluster mode, Spark driver will not disconnect unless we mention the 
> spark_disconnect(sc) in the code.
> Does it make sense to add a timeout feature for driver to exit after certain 
> amount of time, in client mode or cluster mode. I think its only happening 
> with connection from Sparklyr to Yarn. Some times the driver stays there for 
> weeks and holds minimum resources .
> *More  Details:*
> Yarn -2.7.0
> Spark -2.1.0
> Rversion:
> Microsoft R Open 3.4.2
> Rstudio Version:
> rstudio-server-1.1.414-1.x86_64
> yarn application -status application_id
> 18/01/22 09:08:45 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM 
> address to resourcemanager.com/resourcemanager:8032
>  
> Application Report : 
>     Application-Id : application_id
>     Application-Name : sparklyr
>     Application-Type : SPARK
>     User : userid
>     Queue : root.queuename
>     Start-Time : 1516245523965
>     Finish-Time : 0
>     Progress : 0%
>     State : RUNNING
>     Final-State : UNDEFINED
>     Tracking-URL : N/A
>     RPC Port : -1
>     AM Host : N/A
>     Aggregate Resource Allocation :266468 MB-seconds, 59 vcore-seconds
>     Diagnostics : N/A
>  
> [http://spark.rstudio.com/]
>  
> I can provide more details if required
>  
> Thanks,
> Bharath



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40367) Total size of serialized results of 3730 tasks (64.0 GB) is bigger than spark.driver.maxResultSize (64.0 GB)

2022-09-18 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606414#comment-17606414
 ] 

Senthil Kumar commented on SPARK-40367:
---

Hi [~jackyjfhu] 

 

Check if you are sending bytes/rows which are more than 
"spark.driver.maxResultSize". If so, you need to keep increasing 
"spark.driver.maxResultSize" untill it is fixing this issue. But while 
increasing spark.driver.maxResultSize you should be careful that it should not 
exceed driver-memory.

 

_Note: driver-memory > spark.driver.maxResultSize > row/bytes sent to driver_

>  Total size of serialized results of 3730 tasks (64.0 GB) is bigger than 
> spark.driver.maxResultSize (64.0 GB)
> -
>
> Key: SPARK-40367
> URL: https://issues.apache.org/jira/browse/SPARK-40367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: jackyjfhu
>Priority: Blocker
>
>  I use this 
> code:spark.sql("xx").selectExpr(spark.table(target).columns:_*).write.mode("overwrite").insertInto(target),I
>  get an error
>  
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 3730 tasks (64.0 GB) is bigger than 
> spark.driver.maxResultSize (64.0 GB)
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1596)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1830)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1779)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1768)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
>     at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:304)
>     at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
>     at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:97)
>     at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>     at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>     at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> --conf spark.driver.maxResultSize=64g
> --conf spark.sql.broadcastTimeout=36000
> -conf spark.sql.autoBroadcastJoinThreshold=204857600 
> --conf spark.memory.offHeap.enabled=true
> --conf spark.memory.offHeap.size=4g
> --num-executors 500
> 

[jira] [Created] (SPARK-40277) Use DataFrame's column for referring to DDL schema for from_csv() and from_json()

2022-08-30 Thread Jayant Kumar (Jira)
Jayant Kumar created SPARK-40277:


 Summary: Use DataFrame's column for referring to DDL schema for 
from_csv() and from_json()
 Key: SPARK-40277
 URL: https://issues.apache.org/jira/browse/SPARK-40277
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jayant Kumar


With spark's DataFrame api one has to explicitly pass the StrucType to 
functions like from_csv and from_json. This works okay in general.

In certain circumstances when schema depends on the one of the DataFrame's 
field, it gets complicated and one has to switch to RDD. This requires 
additional libraries to be added with additional parsing logic.

I am trying to explore a way to enable such use cases with DataFrame api and 
function itself. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-03 Thread Navin Kumar (Jira)
Navin Kumar created SPARK-39976:
---

 Summary: NULL check in ArrayIntersect adds extraneous null from 
first param
 Key: SPARK-39976
 URL: https://issues.apache.org/jira/browse/SPARK-39976
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Navin Kumar


This is very likely a regression from SPARK-36829.

When using {{array_intersect(a, b)}}, if the first parameter contains a 
{{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
in the output. This also leads to {{array_intersect(a, b) != array_intersect(b, 
a)}} which is incorrect as set intersection should be commutative.

Example using PySpark:

{code:python}
>>> a = [1, 2, 3]
>>> b = [3, None, 5]
>>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
>>> df.show()
+-++
|a|   b|
+-++
|[1, 2, 3]|[3, null, 5]|
+-++

>>> df.selectExpr("array_intersect(a,b)").show()
+-+
|array_intersect(a, b)|
+-+
|  [3]|
+-+

>>> df.selectExpr("array_intersect(b,a)").show()
+-+
|array_intersect(b, a)|
+-+
|[3, null]|
+-+
{code}

Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
output is correct: {{[3]}}. In the second case, since {{b}} does contain 
{{NULL}} and is now the first parameter.

The same behavior occurs in Scala when writing to Parquet:


{code:scala}
scala> val a = Array[java.lang.Integer](1, 2, null, 4)
a: Array[Integer] = Array(1, 2, null, 4)

scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
b: Array[Integer] = Array(4, 5, 6, 7)

scala> val df = Seq((a, b)).toDF("a","b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.write.parquet("/tmp/simple.parquet")

scala> val df = spark.read.parquet("/tmp/simple.parquet")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.show()
+---++
|  a|   b|
+---++
|[1, 2, null, 4]|[4, 5, 6, 7]|
+---++


scala> df.selectExpr("array_intersect(a,b)").show()
+-+
|array_intersect(a, b)|
+-+
|[null, 4]|
+-+


scala> df.selectExpr("array_intersect(b,a)").show()
+-+
|array_intersect(b, a)|
+-+
|  [4]|
+-+
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations

2022-07-27 Thread Navin Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated SPARK-39845:

Description: 
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321

I can also confirm that the same behavior occurs with FloatType and the use of 
{{java.lang.Float.floatToIntBits}}

  was:
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information: 

[jira] [Updated] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations

2022-07-27 Thread Navin Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated SPARK-39845:

Description: 
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0}} as distinct because of the sign bit.

See here for more information: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321
 for 

I can also confirm that the same behavior occurs with FloatType and the use of 
{{java.lang.Float.floatToIntBits}}

  was:
This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0} as distinct because of the sign bit.

See here for more information: 

[jira] [Created] (SPARK-39845) 0.0 and -0.0 are not consistent in set operations

2022-07-22 Thread Navin Kumar (Jira)
Navin Kumar created SPARK-39845:
---

 Summary: 0.0 and -0.0 are not consistent in set operations 
 Key: SPARK-39845
 URL: https://issues.apache.org/jira/browse/SPARK-39845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Navin Kumar


This is a continuation of the issue described in SPARK-32110.

When using Array set-based functions {{array_union}}, {{array_intersect}}, 
{{array_except}} and {{arrays_overlap}}, {{0.0}} and {{-0.0}} have inconsistent 
behavior.

When parsed, {{-0.0}} is normalized to {{0.0}}. Therefore if I use 
{{array_union}} for example with these values directly, {{array(-0.0)}} becomes 
{{array(0.0)}}. See the example below using {{array_union}}:

{code:java}
scala> val df = spark.sql("SELECT array_union(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [array_union(array(0.0), array(0.0)): 
array]
scala> df.collect()
res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0)])
{code}

In this case, {{0.0}} and {{-0.0}} are considered equal and the union of the 
arrays produces a single value: {{0.0}}.

However, if I try this operation using a constructed dataframe, these values 
are not equal, and the result is an array with both {{0.0}} and {{-0.0}}.

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("array_union(a, b)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0.0, -0.0)])
{code}

For {{arrays_overlap}}, here is a similar version of that inconsistency:

{code:java}
scala> val df = spark.sql("SELECT arrays_overlap(array(0.0), array(-0.0))")
df: org.apache.spark.sql.DataFrame = [arrays_overlap(array(0.0), array(0.0)): 
boolean]

scala> df.collect
res4: Array[org.apache.spark.sql.Row] = Array([true])
{code}

{code:java}
scala> val df = List((Array(0.0), Array(-0.0))).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: array]

scala> df.selectExpr("arrays_overlap(a, b)")
res5: org.apache.spark.sql.DataFrame = [arrays_overlap(a, b): boolean]

scala> df.selectExpr("arrays_overlap(a, b)").collect
res6: Array[org.apache.spark.sql.Row] = Array([false])
{code}

It looks like this is due to the fact that in the constructed dataframe case, 
the Double value is hashed by using {{java.lang.Double.doubleToLongBits}}, 
which will treat {{0.0}} and {{-0.0} as distinct because of the sign bit.

See here for more information: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L312-L321
 for 

I can also confirm that the same behavior occurs with FloatType and the use of 
{{java.lang.Float.floatToIntBits}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38213) support Metrics information report to kafkaSink.

2022-02-14 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492391#comment-17492391
 ] 

Senthil Kumar commented on SPARK-38213:
---

Working on this

> support Metrics information report to kafkaSink.
> 
>
> Key: SPARK-38213
> URL: https://issues.apache.org/jira/browse/SPARK-38213
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: YuanGuanhu
>Priority: Major
>
> Spark now support ConsoleSink/CsvSink/GraphiteSink/JmxSink etc. Now we want 
> report metrics information to kafka, we can work to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34511) Current Security vulnerabilities in spark libraries

2022-02-09 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489544#comment-17489544
 ] 

Abhinav Kumar commented on SPARK-34511:
---

Updating to Spark 3.2.1 does solve most of the issues. Critical ones left in 
3.2.1 are log2j 1.2.17 and htrace-core4-4.1.0-incubating SPARK-38061

We still have medium vulnerability

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>   
> [Update - still present]com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> [Update - still present]com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> [Update - still present]Log4j : log4j : 1.2.17
>  SocketServer class that is vulnerable to deserialization of untrusted data: 
> * https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
>           Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> [Fixed]-apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685-
>  * 
> -[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]-
>  * [-https://bugzilla.redhat.com/show_bug.cgi?id=1019176-]
>  
> [Update - still present]com.fasterxml.jackson.core : jackson-databind : 
> 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> [Update - still present ]commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> [Update - still present ]commons-io : commons-io : 2.5 * 
> [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> [Upgraded to 4.1.51.Final still with vulnerabilities, see new below]-io.netty 
> : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351]-
>  * [-https://github.com/netty/netty/pull/10560-]
>  
> [Update - still present]org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> [Update - changed to
> org.apache.hadoop : hadoop-hdfs-client : 3.2.0 see new below
> ]-org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]-
>  * 
> -[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]-
>  * -[https://hadoop.apache.org/cve_list.html]-
>  * -[https://www.openwall.com/lists/oss-security/2019/01/24/3]-
>   --  
>  -org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]-
>  * 
> -[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]-
>  
> [Update - still present]org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> [Update - still present]org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> [Update -still present]org.eclipse.jetty : 

[jira] [Commented] (SPARK-38061) security scan issue with htrace-core4-4.1.0-incubating

2022-02-09 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489537#comment-17489537
 ] 

Abhinav Kumar commented on SPARK-38061:
---

[~hyukjin.kwon] [~sujitbiswas] Are we agreeing to track the vulnerability fix 
for htrace-core4-4.1.0-incubating (building it with jackson 2.12.3 or later). 
BTW.. even 2.12.3 is showing up with medium criticality vulnerability - but 
that is a battle for another day.

Also, [~hyukjin.kwon] I was hoping to see if we can release another version of 
Spark, say 3.2.3 with vulnerability fixes. The issue is that we are using Spark 
in our company and management is getting concerned due to these vulnerability. 
What do you think?

> security scan issue with htrace-core4-4.1.0-incubating
> --
>
> Key: SPARK-38061
> URL: https://issues.apache.org/jira/browse/SPARK-38061
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Security
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Sujit Biswas
>Priority: Major
> Attachments: image-2022-02-03-08-02-29-071.png, 
> scan-security-report-spark-3.2.0-jre-11.csv, 
> scan-security-report-spark-3.2.1-jre-11.csv
>
>
> Hi,
> running into security scan issue with docker image built on 
> spark-3.2.0-bin-hadoop3.2, is there a way to resolve 
>  
> most issues related to https://issues.apache.org/jira/browse/HDFS-15333 
> attaching the CVE report
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-34511) Current Security vulnerabilities in spark libraries

2022-02-08 Thread Abhinav Kumar (Jira)


[ https://issues.apache.org/jira/browse/SPARK-34511 ]


Abhinav Kumar deleted comment on SPARK-34511:
---

was (Author: abhinavofficial):
Hi [~dongjoon] - We are still seeing json-smart 2.3 in the binaries that are 
distributed. Is it there by error? May be the code was not ported in 3.1.2 (we 
are using this) and also in 3.2.0 and 3.2.1. Could you please check?

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>   
> [Update - still present]com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> [Update - still present]com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> [Update - still present]Log4j : log4j : 1.2.17
>  SocketServer class that is vulnerable to deserialization of untrusted data: 
> * https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
>           Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> [Fixed]-apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685-
>  * 
> -[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]-
>  * [-https://bugzilla.redhat.com/show_bug.cgi?id=1019176-]
>  
> [Update - still present]com.fasterxml.jackson.core : jackson-databind : 
> 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> [Update - still present ]commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> [Update - still present ]commons-io : commons-io : 2.5 * 
> [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> [Upgraded to 4.1.51.Final still with vulnerabilities, see new below]-io.netty 
> : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351]-
>  * [-https://github.com/netty/netty/pull/10560-]
>  
> [Update - still present]org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> [Update - changed to
> org.apache.hadoop : hadoop-hdfs-client : 3.2.0 see new below
> ]-org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]-
>  * 
> -[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]-
>  * -[https://hadoop.apache.org/cve_list.html]-
>  * -[https://www.openwall.com/lists/oss-security/2019/01/24/3]-
>   --  
>  -org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]-
>  * 
> -[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]-
>  
> [Update - still present]org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> [Update - still present]org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> [Update -still present]org.eclipse.jetty : 

[jira] [Comment Edited] (SPARK-34511) Current Security vulnerabilities in spark libraries

2022-02-08 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488719#comment-17488719
 ] 

Abhinav Kumar edited comment on SPARK-34511 at 2/8/22, 10:07 AM:
-

Hi [~dongjoon] - We are still seeing json-smart 2.3 in the binaries that are 
distributed. Is it there by error? May be the code was not ported in 3.1.2 (we 
are using this) and also in 3.2.0 and 3.2.1. Could you please check?


was (Author: abhinavofficial):
Hi [~dongjoon] - We are still seeing json-smart 2.3 in the binaries that are 
distributed. Is it there by error?

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>   
> [Update - still present]com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> [Update - still present]com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> [Update - still present]Log4j : log4j : 1.2.17
>  SocketServer class that is vulnerable to deserialization of untrusted data: 
> * https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
>           Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> [Fixed]-apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685-
>  * 
> -[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]-
>  * [-https://bugzilla.redhat.com/show_bug.cgi?id=1019176-]
>  
> [Update - still present]com.fasterxml.jackson.core : jackson-databind : 
> 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> [Update - still present ]commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> [Update - still present ]commons-io : commons-io : 2.5 * 
> [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> [Upgraded to 4.1.51.Final still with vulnerabilities, see new below]-io.netty 
> : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351]-
>  * [-https://github.com/netty/netty/pull/10560-]
>  
> [Update - still present]org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> [Update - changed to
> org.apache.hadoop : hadoop-hdfs-client : 3.2.0 see new below
> ]-org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]-
>  * 
> -[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]-
>  * -[https://hadoop.apache.org/cve_list.html]-
>  * -[https://www.openwall.com/lists/oss-security/2019/01/24/3]-
>   --  
>  -org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]-
>  * 
> -[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]-
>  
> [Update - still present]org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * 

[jira] [Commented] (SPARK-34511) Current Security vulnerabilities in spark libraries

2022-02-08 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488719#comment-17488719
 ] 

Abhinav Kumar commented on SPARK-34511:
---

Hi [~dongjoon] - We are still seeing json-smart 2.3 in the binaries that are 
distributed. Is it there by error?

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>   
> [Update - still present]com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> [Update - still present]com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> [Update - still present]Log4j : log4j : 1.2.17
>  SocketServer class that is vulnerable to deserialization of untrusted data: 
> * https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
>           Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> [Fixed]-apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685-
>  * 
> -[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]-
>  * [-https://bugzilla.redhat.com/show_bug.cgi?id=1019176-]
>  
> [Update - still present]com.fasterxml.jackson.core : jackson-databind : 
> 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> [Update - still present ]commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> [Update - still present ]commons-io : commons-io : 2.5 * 
> [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> [Upgraded to 4.1.51.Final still with vulnerabilities, see new below]-io.netty 
> : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351]-
>  * [-https://github.com/netty/netty/pull/10560-]
>  
> [Update - still present]org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> [Update - changed to
> org.apache.hadoop : hadoop-hdfs-client : 3.2.0 see new below
> ]-org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]-
>  * 
> -[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]-
>  * -[https://hadoop.apache.org/cve_list.html]-
>  * -[https://www.openwall.com/lists/oss-security/2019/01/24/3]-
>   --  
>  -org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]-
>  * 
> -[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]-
>  
> [Update - still present]org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> [Update - still present]org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> [Update -still present]org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * 
> 

[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals

2022-01-21 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480099#comment-17480099
 ] 

Senthil Kumar commented on SPARK-37936:
---

[~maxgekk] ,  I have queries which matches 

 
 * invalidIntervalFormError - "SELECT INTERVAL '1 DAY 2' HOUR"

 * fromToIntervalUnsupportedError - "SELECT extract(MONTH FROM INTERVAL 
'2021-11' YEAR TO DAY)"

it will be helpful if you share queries for below scenarios,
 * moreThanOneFromToUnitInIntervalLiteralError
 * invalidIntervalLiteralError

 * invalidFromToUnitValueError
 * mixedIntervalUnitsError

> Use error classes in the parsing errors of intervals
> 
>
> Key: SPARK-37936
> URL: https://issues.apache.org/jira/browse/SPARK-37936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Modify the following methods in QueryParsingErrors:
>  * moreThanOneFromToUnitInIntervalLiteralError
>  * invalidIntervalLiteralError
>  * invalidIntervalFormError
>  * invalidFromToUnitValueError
>  * fromToIntervalUnsupportedError
>  * mixedIntervalUnitsError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37944) Use error classes in the execution errors of casting

2022-01-17 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477551#comment-17477551
 ] 

Senthil Kumar commented on SPARK-37944:
---

I will work on this

> Use error classes in the execution errors of casting
> 
>
> Key: SPARK-37944
> URL: https://issues.apache.org/jira/browse/SPARK-37944
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * failedToCastValueToDataTypeForPartitionColumnError
> * invalidInputSyntaxForNumericError
> * cannotCastToDateTimeError
> * invalidInputSyntaxForBooleanError
> * nullLiteralsCannotBeCastedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-01-17 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477550#comment-17477550
 ] 

Senthil Kumar commented on SPARK-37945:
---

I will work on this

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37940) Use error classes in the compilation errors of partitions

2022-01-17 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477549#comment-17477549
 ] 

Senthil Kumar commented on SPARK-37940:
---

I will work on this

> Use error classes in the compilation errors of partitions
> -
>
> Key: SPARK-37940
> URL: https://issues.apache.org/jira/browse/SPARK-37940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * unsupportedIfNotExistsError
> * nonPartitionColError
> * missingStaticPartitionColumn
> * alterV2TableSetLocationWithPartitionNotSupportedError
> * invalidPartitionSpecError
> * partitionNotSpecifyLocationUriError
> * describeDoesNotSupportPartitionForV2TablesError
> * tableDoesNotSupportPartitionManagementError
> * tableDoesNotSupportAtomicPartitionManagementError
> * alterTableRecoverPartitionsNotSupportedForV2TablesError
> * partitionColumnNotSpecifiedError
> * invalidPartitionColumnError
> * multiplePartitionColumnValuesSpecifiedError
> * cannotUseDataTypeForPartitionColumnError
> * cannotUseAllColumnsForPartitionColumnsError
> * partitionColumnNotFoundInSchemaError
> * mismatchedTablePartitionColumnError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37939) Use error classes in the parsing errors of properties

2022-01-17 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477548#comment-17477548
 ] 

Senthil Kumar commented on SPARK-37939:
---

I will work on this

> Use error classes in the parsing errors of properties
> -
>
> Key: SPARK-37939
> URL: https://issues.apache.org/jira/browse/SPARK-37939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * cannotCleanReservedNamespacePropertyError
> * cannotCleanReservedTablePropertyError
> * invalidPropertyKeyForSetQuotedConfigurationError
> * invalidPropertyValueForSetQuotedConfigurationError
> * propertiesAndDbPropertiesBothSpecifiedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37936) Use error classes in the parsing errors of intervals

2022-01-17 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477544#comment-17477544
 ] 

Senthil Kumar commented on SPARK-37936:
---

Working on this

> Use error classes in the parsing errors of intervals
> 
>
> Key: SPARK-37936
> URL: https://issues.apache.org/jira/browse/SPARK-37936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Modify the following methods in QueryParsingErrors:
>  * moreThanOneFromToUnitInIntervalLiteralError
>  * invalidIntervalLiteralError
>  * invalidIntervalFormError
>  * invalidFromToUnitValueError
>  * fromToIntervalUnsupportedError
>  * mixedIntervalUnitsError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37475) Add Scale Parameter to Floor and Ceil functions

2021-11-27 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37475:
--
Description: 
This feature is proposed in the PR : https://github.com/apache/spark/pull/34593

Currently we support Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Floor and Ceil functions helps to do this but it doesn't support the position 
of the rounding. Adding scale parameter to the functions would help us control 
the rounding positions. 

 

Snowflake supports `scale` parameter to `floor`/`ceil` :
{code:java}
FLOOR(  [,  ] ){code}
REF:

[https://docs.snowflake.com/en/sql-reference/functions/floor.html]

 

 

 

 

  was:
Currently we support Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Floor and Ceil functions helps to do this but it doesn't support the position 
of the rounding. Adding scale parameter to the functions would help us control 
the rounding positions. 

 

Snowflake supports `scale` parameter to `floor`/`ceil` :
{code:java}
FLOOR(  [,  ] ){code}
REF:

[https://docs.snowflake.com/en/sql-reference/functions/floor.html]

 

 

 

 


> Add Scale Parameter to Floor and Ceil functions
> ---
>
> Key: SPARK-37475
> URL: https://issues.apache.org/jira/browse/SPARK-37475
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> This feature is proposed in the PR : 
> https://github.com/apache/spark/pull/34593
> Currently we support Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
> (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Floor and Ceil functions helps to do this but it doesn't support the position 
> of the rounding. Adding scale parameter to the functions would help us 
> control the rounding positions. 
>  
> Snowflake supports `scale` parameter to `floor`/`ceil` :
> {code:java}
> FLOOR(  [,  ] ){code}
> REF:
> [https://docs.snowflake.com/en/sql-reference/functions/floor.html]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37475) Add Scale Parameter to Floor and Ceil functions

2021-11-27 Thread Sathiya Kumar (Jira)
Sathiya Kumar created SPARK-37475:
-

 Summary: Add Scale Parameter to Floor and Ceil functions
 Key: SPARK-37475
 URL: https://issues.apache.org/jira/browse/SPARK-37475
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Sathiya Kumar


Currently we support Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Floor and Ceil functions helps to do this but it doesn't support the position 
of the rounding. Adding scale parameter to the functions would help us control 
the rounding positions. 

 

Snowflake supports `scale` parameter to `floor`/`ceil` :
{code:java}
FLOOR(  [,  ] ){code}
REF:

[https://docs.snowflake.com/en/sql-reference/functions/floor.html]

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37433) TimeZoneAwareExpression throws NoSuchElementException: None.get on expr.eval()

2021-11-21 Thread Sathiya Kumar (Jira)
Sathiya Kumar created SPARK-37433:
-

 Summary: TimeZoneAwareExpression throws NoSuchElementException: 
None.get on expr.eval()
 Key: SPARK-37433
 URL: https://issues.apache.org/jira/browse/SPARK-37433
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Sathiya Kumar


TimeZoneAwareExpression like hour, date_format etc. throws 

NoSuchElementException: None.get on expr.eval()

*hour(current_timestamp).expr.eval()*

*date_format(current_timestamp, "dd").expr.eval()*

 
{code:java}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:529)
  at scala.None$.get(Option.scala:527)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:53)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:53)
  at 
org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId$lzycompute(datetimeExpressions.scala:772)
  at 
org.apache.spark.sql.catalyst.expressions.DateFormatClass.zoneId(datetimeExpressions.scala:772)
  at 
org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter(datetimeExpressions.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.getFormatter$(datetimeExpressions.scala:67)
  at 
org.apache.spark.sql.catalyst.expressions.DateFormatClass.getFormatter(datetimeExpressions.scala:772)
  at 
org.apache.spark.sql.catalyst.expressions.TimestampFormatterHelper.$anonfun$formatterOption$1(datetimeExpressions.scala:64)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-11-17 Thread Siddharth Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445513#comment-17445513
 ] 

Siddharth Kumar edited comment on SPARK-18105 at 11/17/21, 10:28 PM:
-

Hi, I saw a similar failure just as [~vladimir.prus]. In my experiment, I 
enabled node decommissioning along with a decommission fallback storage and 
then terminated an executor while the shuffle blocks are being fetched. The 
node decommissioning begins for the lost executor and migrates all the shuffle 
blocks to peer executors. Post migration, when the shuffle blocks are being 
fetched, I see the "FetchFailedException: Stream is corrupted" and "Error 
decoding offset 19258 of input buffer" message as seen in this thread. The 
error goes away when I do not add the fallback storage option.

These were the options I set in my experiment
{code:java}
spark.decommission.enabled: true,
spark.storage.decommission.shuffleBlocks.enabled : true,
spark.storage.decommission.enabled: true,
spark.storage.decommission.fallbackStorage.path : s3:///# Stopped 
seeing errors after removing this{code}


was (Author: JIRAUSER280416):
Hi, I saw a similar failure just as [~vladimir.prus]. In my experiment, I 
enabled node decommissioning along with a decommission fallback storage and 
then terminated an executor while the shuffle blocks are being fetched. The 
node decommissioning begins for the lost executor and migrates all the shuffle 
blocks to peer executors. Post migration, when the shuffle blocks are being 
fetched, I see the "FetchFailedException: Stream is corrupted" and "Error 
decoding offset 19258 of input buffer" message as seen in this thread. The 
error goes away when I do not add the fallback storage option.

These were the options I set in my experiment
{code:java}
spark.decommission.enabled: true,
spark.storage.decommission.shuffleBlocks.enabled : true,
spark.storage.decommission.enabled: true,
spark.storage.decommission.fallbackStorage.path : s3:///# Stopped 
seeing errors after removing this{code}
 

 

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> 

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-11-17 Thread Siddharth Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445513#comment-17445513
 ] 

Siddharth Kumar commented on SPARK-18105:
-

Hi, I saw a similar failure just as [~vladimir.prus]. In my experiment, I 
enabled node decommissioning along with a decommission fallback storage and 
then terminated an executor while the shuffle blocks are being fetched. The 
node decommissioning begins for the lost executor and migrates all the shuffle 
blocks to peer executors. Post migration, when the shuffle blocks are being 
fetched, I see the "FetchFailedException: Stream is corrupted" and "Error 
decoding offset 19258 of input buffer" message as seen in this thread. The 
error goes away when I do not add the fallback storage option.

These were the options I set in my experiment
{code:java}
spark.decommission.enabled: true,
spark.storage.decommission.shuffleBlocks.enabled : true,
spark.storage.decommission.enabled: true,
spark.storage.decommission.fallbackStorage.path : s3:///# Stopped 
seeing errors after removing this{code}
 

 

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-15 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: 
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
*SAP Hana Sql ROUND function does it :* 
{code:java}
ROUND( [,  [, ]]){code}
REF : 
[https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html]


*Sql Server does something similar to this* :
{code:java}
ROUND ( numeric_expression , length [ ,function ] ){code}
REF : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 

 

  was:
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
Sql Server does something similar to this : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 


> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. 
> *SAP Hana Sql ROUND function does it :* 
> {code:java}
> ROUND( [,  [, ]]){code}
> REF : 
> [https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html]
> *Sql Server does something similar to this* :
> {code:java}
> ROUND ( numeric_expression , length [ ,function ] ){code}
> REF : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: 
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
Sql Server does something similar to this : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 

  was:Currently we have support for 


> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: Currently we have support for 

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we have support for 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)
Sathiya Kumar created SPARK-37324:
-

 Summary: Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
 Key: SPARK-37324
 URL: https://issues.apache.org/jira/browse/SPARK-37324
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Sathiya Kumar






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue

2021-10-13 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428141#comment-17428141
 ] 

Senthil Kumar commented on SPARK-36996:
---

Sample Output after this changes:

SQL :

mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
varchar(255), Age int);

 

mysql> desc Persons;
+---+--+--+-+-+---+
| Field | Type | Null | Key | Default | Extra |
+---+--+--+-+-+---+
| Id | int | NO | | NULL | |
| FirstName | varchar(255) | YES | | NULL | |
| LastName | varchar(255) | YES | | NULL | |
| Age | int | YES | | NULL | |
+---+--+--+-+-+---+

--++---+++

Spark:

scala> val df = 
spark.read.format("jdbc").option("database","Test_DB").option("user", 
"root").option("password", "").option("driver", 
"com.mysql.cj.jdbc.Driver").option("url", 
"jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
 df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
fields]

scala> df.printSchema()
 root
 |-- Id: integer (nullable = false)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Age: integer (nullable = true)

 

 

And for TIMESTAMP columns

 

SQL:
create table timestamp_test(id int(11), time_stamp timestamp not null default 
current_timestamp);

SPARK:

scala> val df = 
spark.read.format("jdbc").option("database","Test_DB").option("user", 
"root").option("password", "").option("driver", 
"com.mysql.cj.jdbc.Driver").option("url", 
"jdbc:mysql://localhost:3306/Test_DB").option("dbtable", 
"timestamp_test").load()
df: org.apache.spark.sql.DataFrame = [id: int, time_stamp: timestamp]

scala> df.printSchema()
root
|-- id: integer (nullable = true)
|-- time_stamp: timestamp (nullable = true)

> fixing "SQL column nullable setting not retained as part of spark read" issue
> -
>
> Key: SPARK-36996
> URL: https://issues.apache.org/jira/browse/SPARK-36996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Sql 'nullable' columns are not retaining 'nullable' type as it is while 
> reading from Spark read using jdbc format.
>  
> SQL :
> 
>  
> mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
> varchar(255), Age int);
>  
> mysql> desc Persons;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | Id | int | NO | | NULL | |
> | FirstName | varchar(255) | YES | | NULL | |
> | LastName | varchar(255) | YES | | NULL | |
> | Age | int | YES | | NULL | |
> +---+--+--+-+-+---+
>  
> But in Spark  we get all the columns as "Nullable":
> =
> scala> val df = 
> spark.read.format("jdbc").option("database","Test_DB").option("user", 
> "root").option("password", "").option("driver", 
> "com.mysql.cj.jdbc.Driver").option("url", 
> "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
> df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
> fields]
> scala> df.printSchema()
> root
>  |-- Id: integer (nullable = true)
>  |-- FirstName: string (nullable = true)
>  |-- LastName: string (nullable = true)
>  |-- Age: integer (nullable = true)
> =
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue

2021-10-13 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428140#comment-17428140
 ] 

Senthil Kumar commented on SPARK-36996:
---

We need to consider 2 scenarios

 
 # maintain NULLABLE value as per SQL metadata for non timestamp columns
 # set NULLABLE as true(always) for timestamp columns

 

 

> fixing "SQL column nullable setting not retained as part of spark read" issue
> -
>
> Key: SPARK-36996
> URL: https://issues.apache.org/jira/browse/SPARK-36996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Sql 'nullable' columns are not retaining 'nullable' type as it is while 
> reading from Spark read using jdbc format.
>  
> SQL :
> 
>  
> mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
> varchar(255), Age int);
>  
> mysql> desc Persons;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | Id | int | NO | | NULL | |
> | FirstName | varchar(255) | YES | | NULL | |
> | LastName | varchar(255) | YES | | NULL | |
> | Age | int | YES | | NULL | |
> +---+--+--+-+-+---+
>  
> But in Spark  we get all the columns as "Nullable":
> =
> scala> val df = 
> spark.read.format("jdbc").option("database","Test_DB").option("user", 
> "root").option("password", "").option("driver", 
> "com.mysql.cj.jdbc.Driver").option("url", 
> "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
> df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
> fields]
> scala> df.printSchema()
> root
>  |-- Id: integer (nullable = true)
>  |-- FirstName: string (nullable = true)
>  |-- LastName: string (nullable = true)
>  |-- Age: integer (nullable = true)
> =
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue

2021-10-13 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428104#comment-17428104
 ] 

Senthil Kumar commented on SPARK-36996:
---

Based on further analysis, Spark is hard-coding "nullable" as "true" always. 
This change has been inccluded due to 
"https://issues.apache.org/jira/browse/SPARK-19726;.

 

 

> fixing "SQL column nullable setting not retained as part of spark read" issue
> -
>
> Key: SPARK-36996
> URL: https://issues.apache.org/jira/browse/SPARK-36996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Sql 'nullable' columns are not retaining 'nullable' type as it is while 
> reading from Spark read using jdbc format.
>  
> SQL :
> 
>  
> mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
> varchar(255), Age int);
>  
> mysql> desc Persons;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | Id | int | NO | | NULL | |
> | FirstName | varchar(255) | YES | | NULL | |
> | LastName | varchar(255) | YES | | NULL | |
> | Age | int | YES | | NULL | |
> +---+--+--+-+-+---+
>  
> But in Spark  we get all the columns as "Nullable":
> =
> scala> val df = 
> spark.read.format("jdbc").option("database","Test_DB").option("user", 
> "root").option("password", "").option("driver", 
> "com.mysql.cj.jdbc.Driver").option("url", 
> "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
> df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
> fields]
> scala> df.printSchema()
> root
>  |-- Id: integer (nullable = true)
>  |-- FirstName: string (nullable = true)
>  |-- LastName: string (nullable = true)
>  |-- Age: integer (nullable = true)
> =
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue

2021-10-13 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428105#comment-17428105
 ] 

Senthil Kumar commented on SPARK-36996:
---

I m working on this

> fixing "SQL column nullable setting not retained as part of spark read" issue
> -
>
> Key: SPARK-36996
> URL: https://issues.apache.org/jira/browse/SPARK-36996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Sql 'nullable' columns are not retaining 'nullable' type as it is while 
> reading from Spark read using jdbc format.
>  
> SQL :
> 
>  
> mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
> varchar(255), Age int);
>  
> mysql> desc Persons;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | Id | int | NO | | NULL | |
> | FirstName | varchar(255) | YES | | NULL | |
> | LastName | varchar(255) | YES | | NULL | |
> | Age | int | YES | | NULL | |
> +---+--+--+-+-+---+
>  
> But in Spark  we get all the columns as "Nullable":
> =
> scala> val df = 
> spark.read.format("jdbc").option("database","Test_DB").option("user", 
> "root").option("password", "").option("driver", 
> "com.mysql.cj.jdbc.Driver").option("url", 
> "jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
> df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
> fields]
> scala> df.printSchema()
> root
>  |-- Id: integer (nullable = true)
>  |-- FirstName: string (nullable = true)
>  |-- LastName: string (nullable = true)
>  |-- Age: integer (nullable = true)
> =
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36996) fixing "SQL column nullable setting not retained as part of spark read" issue

2021-10-13 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36996:
-

 Summary: fixing "SQL column nullable setting not retained as part 
of spark read" issue
 Key: SPARK-36996
 URL: https://issues.apache.org/jira/browse/SPARK-36996
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.0
Reporter: Senthil Kumar


Sql 'nullable' columns are not retaining 'nullable' type as it is while reading 
from Spark read using jdbc format.

 

SQL :



 

mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
varchar(255), Age int);

 

mysql> desc Persons;
+---+--+--+-+-+---+
| Field | Type | Null | Key | Default | Extra |
+---+--+--+-+-+---+
| Id | int | NO | | NULL | |
| FirstName | varchar(255) | YES | | NULL | |
| LastName | varchar(255) | YES | | NULL | |
| Age | int | YES | | NULL | |
+---+--+--+-+-+---+

 

But in Spark  we get all the columns as "Nullable":

=

scala> val df = 
spark.read.format("jdbc").option("database","Test_DB").option("user", 
"root").option("password", "").option("driver", 
"com.mysql.cj.jdbc.Driver").option("url", 
"jdbc:mysql://localhost:3306/Test_DB").option("dbtable", "Persons").load()
df: org.apache.spark.sql.DataFrame = [Id: int, FirstName: string ... 2 more 
fields]

scala> df.printSchema()
root
 |-- Id: integer (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Age: integer (nullable = true)

=

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36925) Overriding parameters

2021-10-04 Thread Abhinav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Kumar updated SPARK-36925:
--
Description: 
Hello team, One quick question.

I read somewhere in the doc - Order of priority

Value set up in spark-submit overrides value set up in code overrides value set 
up in .conf

Is this correct understanding? If so, we may have problem.

I was trying to set spark.kubernetes.namespace. I had used default in code but 
wanted to override in submit-submit. Application kept failing. Finally I had to 
remove from code to make it work.

Same for some other kubernetes setups like PersistentVolume set up.

 

  was:
Hello team, One quick question.

I read somewhere in the doc - Order of priority

Value set up in spark-submit overrides value set up in code overrides value set 
up in .conf

If this correct understanding. If so, we may have problem.

I was trying to set spark.kubernetes.namespace. I had used default in code but 
wanted to override in submit-submit. Application kept failing. Finally I had to 
remove from code to make it work.

Same for some other kubernetes setups like PersistentVolume set up.

 


> Overriding parameters
> -
>
> Key: SPARK-36925
> URL: https://issues.apache.org/jira/browse/SPARK-36925
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Abhinav Kumar
>Priority: Minor
>
> Hello team, One quick question.
> I read somewhere in the doc - Order of priority
> Value set up in spark-submit overrides value set up in code overrides value 
> set up in .conf
> Is this correct understanding? If so, we may have problem.
> I was trying to set spark.kubernetes.namespace. I had used default in code 
> but wanted to override in submit-submit. Application kept failing. Finally I 
> had to remove from code to make it work.
> Same for some other kubernetes setups like PersistentVolume set up.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36925) Overriding parameters

2021-10-04 Thread Abhinav Kumar (Jira)
Abhinav Kumar created SPARK-36925:
-

 Summary: Overriding parameters
 Key: SPARK-36925
 URL: https://issues.apache.org/jira/browse/SPARK-36925
 Project: Spark
  Issue Type: Question
  Components: Input/Output
Affects Versions: 3.1.2, 3.1.1
Reporter: Abhinav Kumar


Hello team, One quick question.

I read somewhere in the doc - Order of priority

Value set up in spark-submit overrides value set up in code overrides value set 
up in .conf

If this correct understanding. If so, we may have problem.

I was trying to set spark.kubernetes.namespace. I had used default in code but 
wanted to override in submit-submit. Application kept failing. Finally I had to 
remove from code to make it work.

Same for some other kubernetes setups like PersistentVolume set up.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36238) Spark UI load event timeline too slow for huge stage

2021-10-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423419#comment-17423419
 ] 

Senthil Kumar commented on SPARK-36238:
---

[~angerszhuuu] Did you try increasing heap memory for Spark History Server?

> Spark UI  load event timeline too slow for huge stage
> -
>
> Key: SPARK-36238
> URL: https://issues.apache.org/jira/browse/SPARK-36238
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36901) ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs

2021-10-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423409#comment-17423409
 ] 

Senthil Kumar commented on SPARK-36901:
---

[~rangareddy.av...@gmail.com]

It looks like normal behaviour of Spark. Due to Spark's lazy behaviour, it 
tries to execute "BroadcastExchangeExec" and then it finds that there are lack 
of resources in cluster and hence throws WARN messages and then wait for 300s 
and then throws ERROR messages stating that ""BroadcastExchangeExec" timeout.

> ERROR exchange.BroadcastExchangeExec: Could not execute broadcast in 300 secs
> -
>
> Key: SPARK-36901
> URL: https://issues.apache.org/jira/browse/SPARK-36901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Ranga Reddy
>Priority: Major
>
> While running Spark application, if there are no further resources to launch 
> executors, Spark application is failed after 5 mins with below exception.
> {code:java}
> 21/09/24 06:12:45 WARN cluster.YarnScheduler: Initial job has not accepted 
> any resources; check your cluster UI to ensure that workers are registered 
> and have sufficient resources
> ...
> 21/09/24 06:17:29 ERROR exchange.BroadcastExchangeExec: Could not execute 
> broadcast in 300 secs.
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146)
>   ... 71 more
> 21/09/24 06:17:30 INFO spark.SparkContext: Invoking stop() from shutdown hook
> {code}
> *Expectation* should be either needs to be throw proper exception saying 
> *"there are no further to resources to run the application"* or it needs to 
> be *"wait till it get resources"*.
> To reproduce the issue we have used following sample code.
> *PySpark Code (test_broadcast_timeout.py):*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Test Broadcast Timeout").getOrCreate()
> t1 = spark.range(5)
> t2 = spark.range(5)
> q = t1.join(t2,t1.id == t2.id)
> q.explain
> q.show(){code}
> *Spark Submit Command:*
> {code:java}
> spark-submit --executor-memory 512M test_broadcast_timeout.py{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-10-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423402#comment-17423402
 ] 

Senthil Kumar edited comment on SPARK-36861 at 10/1/21, 7:24 PM:
-

Yes in Spark 3.3, hour column is created as "DateType" but I could see hour 
part in subdirs created

===

Spark session available as 'spark'.
 Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/_,_/_/ /_/_\ version 3.3.0-SNAPSHOT
 /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
 Type in expressions to have them evaluated.
 Type :help for more information.

scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), 
("2021-01-01T02", 2)).toDF("hour", "i")
 df: org.apache.spark.sql.DataFrame = [hour: string, i: int]

scala> df.write.partitionBy("hour").parquet("/tmp/t1")

scala> spark.read.parquet("/tmp/t1").schema
 res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true))

scala>

===

 

and subdirs created are

===

ls -l
 total 0
 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02

===

 

It will be helpful if you share the list of sub-dirs created in your case.


was (Author: senthh):
Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part 
in subdirs created

===

Spark session available as 'spark'.
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT
 /_/
 
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), 
("2021-01-01T02", 2)).toDF("hour", "i")
df: org.apache.spark.sql.DataFrame = [hour: string, i: int]

scala> df.write.partitionBy("hour").parquet("/tmp/t1")
 
scala> spark.read.parquet("/tmp/t1").schema
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true))

scala>

===

 

and subdirs created are

===

ls -l
total 0
-rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02

===

 

It will be helpful if you share the list of sub-dirs created in your case.

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-10-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423402#comment-17423402
 ] 

Senthil Kumar commented on SPARK-36861:
---

Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part 
in subdirs created

===

Spark session available as 'spark'.
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT
 /_/
 
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), 
("2021-01-01T02", 2)).toDF("hour", "i")
df: org.apache.spark.sql.DataFrame = [hour: string, i: int]

scala> df.write.partitionBy("hour").parquet("/tmp/t1")
 
scala> spark.read.parquet("/tmp/t1").schema
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true))

scala>

===

 

and subdirs created are

===

ls -l
total 0
-rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02

===

 

It will be helpful if you share the list of sub-dirs created in your case.

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-28 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421328#comment-17421328
 ] 

Senthil Kumar commented on SPARK-36861:
---

[~tanelk] This is issue not reproducable even in 3.1.2

 

root
 |-- i: integer (nullable = true)
 |-- hour: string (nullable = true)

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36781) The log could not get the correct line number

2021-09-28 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421281#comment-17421281
 ] 

Senthil Kumar commented on SPARK-36781:
---

[~chenxusheng] Could you please share the sample code to simulate this issue?

> The log could not get the correct line number
> -
>
> Key: SPARK-36781
> URL: https://issues.apache.org/jira/browse/SPARK-36781
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.3, 3.1.2
>Reporter: chenxusheng
>Priority: Major
>
> INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  MemoryStore cleared
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManager stopped
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManagerMaster stopped
>  INFO 18:16:46 [dispatcher-event-loop-0] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  OutputCommitCoordinator stopped!
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Successfully stopped SparkContext
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Shutdown hook called
> all are : {color:#FF}Logging.scala:54{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36853) Code failing on checkstyle

2021-09-26 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420258#comment-17420258
 ] 

Abhinav Kumar commented on SPARK-36853:
---

This error in thrown in Windows in Maven installing phase. The build succeeds 
but with these errors. 

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Priority: Trivial
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match pattern 
> 

[jira] [Created] (SPARK-36853) Code failing on checkstyle

2021-09-25 Thread Abhinav Kumar (Jira)
Abhinav Kumar created SPARK-36853:
-

 Summary: Code failing on checkstyle
 Key: SPARK-36853
 URL: https://issues.apache.org/jira/browse/SPARK-36853
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Abhinav Kumar
 Fix For: 3.3.0


There are more - just pasting sample 

 

[INFO] There are 32 errors reported by Checkstyle 8.43 with dev/checkstyle.xml 
ruleset.
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 107).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 116).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 104).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 125).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 109).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 134).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 114).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 143).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 119).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 152).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 124).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 161).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 129).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 170).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 134).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 179).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 139).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 188).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 144).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 197).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
LineLength: Line is longer than 100 characters (found 149).
[ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
LineLength: Line is longer than 100 characters (found 206).
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
(naming) MethodName: Method name 'Once' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
(naming) MethodName: Method name 'AvailableNow' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[120,25] 
(naming) MethodName: Method name 'Continuous' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[135,25] 
(naming) MethodName: Method name 'Continuous' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] 

[jira] [Updated] (SPARK-36801) Document change for Spark sql jdbc

2021-09-19 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36801:
--
Description: 
Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
changes "non nullable" columns to "nullable".

 

For example:

mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
varchar(255), Age int);
Query OK, 0 rows affected (0.04 sec)

mysql> show tables;
+---+
| Tables_in_test_db |
+---+
| Persons |
+---+
1 row in set (0.00 sec)

mysql> desc Persons;
+---+--+--+-+-+---+
| Field | Type | Null | Key | Default | Extra |
+---+--+--+-+-+---+
| Id | int | NO | | NULL | |
| FirstName | varchar(255) | YES | | NULL | |
| LastName | varchar(255) | YES | | NULL | |
| Age | int | YES | | NULL | |
+---+--+--+-+-+---+

 

 

{color:#cc7832}val {color}df = spark.read.format({color:#6a8759}"jdbc"{color})
 
.option({color:#6a8759}"database"{color}{color:#cc7832},{color}{color:#6a8759}"Test_DB"{color})
 .option({color:#6a8759}"user"{color}{color:#cc7832}, 
{color}{color:#6a8759}"root"{color})
 .option({color:#6a8759}"password"{color}{color:#cc7832}, 
{color}{color:#6a8759}""{color})
 .option({color:#6a8759}"driver"{color}{color:#cc7832}, 
{color}{color:#6a8759}"com.mysql.cj.jdbc.Driver"{color})
 .option({color:#6a8759}"url"{color}{color:#cc7832}, 
{color}{color:#6a8759}"jdbc:mysql://localhost:3306/Test_DB"{color})
 .option({color:#6a8759}"query"{color}{color:#cc7832}, 
{color}{color:#6a8759}"(select * from Persons)"{color})
 .load()
df.printSchema()

 

*output:*

 

root
 |-- Id: integer (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Age: integer (nullable = true)

 

 

So we need to add a note, in Documentation[1], "All columns are automatically 
converted to be nullable for compatibility reasons."

 Ref:

[1 
][https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases]

 

 

  was:
Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
changes "non nullable" columns to "nullable".

So we need to add a note, in Documentation[1], "All columns are automatically 
converted to be nullable for compatibility reasons."

 

[1 
]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases


> Document change for Spark sql jdbc
> --
>
> Key: SPARK-36801
> URL: https://issues.apache.org/jira/browse/SPARK-36801
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Trivial
>
> Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
> changes "non nullable" columns to "nullable".
>  
> For example:
> mysql> CREATE TABLE Persons(Id int NOT NULL, FirstName varchar(255), LastName 
> varchar(255), Age int);
> Query OK, 0 rows affected (0.04 sec)
> mysql> show tables;
> +---+
> | Tables_in_test_db |
> +---+
> | Persons |
> +---+
> 1 row in set (0.00 sec)
> mysql> desc Persons;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | Id | int | NO | | NULL | |
> | FirstName | varchar(255) | YES | | NULL | |
> | LastName | varchar(255) | YES | | NULL | |
> | Age | int | YES | | NULL | |
> +---+--+--+-+-+---+
>  
>  
> {color:#cc7832}val {color}df = spark.read.format({color:#6a8759}"jdbc"{color})
>  
> .option({color:#6a8759}"database"{color}{color:#cc7832},{color}{color:#6a8759}"Test_DB"{color})
>  .option({color:#6a8759}"user"{color}{color:#cc7832}, 
> {color}{color:#6a8759}"root"{color})
>  .option({color:#6a8759}"password"{color}{color:#cc7832}, 
> {color}{color:#6a8759}""{color})
>  .option({color:#6a8759}"driver"{color}{color:#cc7832}, 
> {color}{color:#6a8759}"com.mysql.cj.jdbc.Driver"{color})
>  .option({color:#6a8759}"url"{color}{color:#cc7832}, 
> {color}{color:#6a8759}"jdbc:mysql://localhost:3306/Test_DB"{color})
>  .option({color:#6a8759}"query"{color}{color:#cc7832}, 
> {color}{color:#6a8759}"(select * from Persons)"{color})
>  .load()
> df.printSchema()
>  
> *output:*
>  
> root
>  |-- Id: integer (nullable = true)
>  |-- FirstName: string (nullable = true)
>  |-- LastName: string (nullable = true)
>  |-- Age: integer (nullable = true)
>  
>  
> So we need to add a note, in Documentation[1], "All columns are automatically 
> converted to be nullable for compatibility reasons."
>  Ref:
> [1 
> 

[jira] [Updated] (SPARK-36801) Document change for Spark sql jdbc

2021-09-19 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36801:
--
Description: 
Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
changes "non nullable" columns to "nullable".

So we need to add a note, in Documentation[1], "All columns are automatically 
converted to be nullable for compatibility reasons."

 

[1 
]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases

  was:
Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
changes "non nullable" columns to "nullable".

So we need to add a note, in Documentation[1], "{color:#a9b7c6}All columns are 
automatically converted to be nullable for compatibility reasons.{color}"

 

[1 
]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases


> Document change for Spark sql jdbc
> --
>
> Key: SPARK-36801
> URL: https://issues.apache.org/jira/browse/SPARK-36801
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>Reporter: Senthil Kumar
>Priority: Trivial
>
> Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
> changes "non nullable" columns to "nullable".
> So we need to add a note, in Documentation[1], "All columns are automatically 
> converted to be nullable for compatibility reasons."
>  
> [1 
> ]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36801) Document change for Spark sql jdbc

2021-09-19 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36801:
-

 Summary: Document change for Spark sql jdbc
 Key: SPARK-36801
 URL: https://issues.apache.org/jira/browse/SPARK-36801
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.0
Reporter: Senthil Kumar


Reading using Spark SQL jdbc DataSource  does not maintain nullable type and 
changes "non nullable" columns to "nullable".

So we need to add a note, in Documentation[1], "{color:#a9b7c6}All columns are 
automatically converted to be nullable for compatibility reasons.{color}"

 

[1 
]https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#jdbc-to-other-databases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-14 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414746#comment-17414746
 ] 

Senthil Kumar commented on SPARK-36743:
---

[~hyukjin.kwon], [~dongjoon]. Thanks for the kind and immediate response on 
this.

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414242#comment-17414242
 ] 

Senthil Kumar commented on SPARK-36743:
---

[~hyukjin.kwon], [~dongjoon]

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
> Fix For: 3.3.0
>
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36743:
--
Summary: Backporting SPARK-36327 changes into Spark 2.4 version  (was: 
Backporting changes into Spark 2.4 version)

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
> Fix For: 3.3.0
>
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36743) Backporting changes into Spark 2.4 version

2021-09-13 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36743:
-

 Summary: Backporting changes into Spark 2.4 version
 Key: SPARK-36743
 URL: https://issues.apache.org/jira/browse/SPARK-36743
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Senthil Kumar
 Fix For: 3.3.0


Could we back port changes merged by PR 
[https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-09-02 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408750#comment-17408750
 ] 

Senthil Kumar edited comment on SPARK-35623 at 9/2/21, 11:56 AM:
-

[~dipanjanK] Include me too pls.

mail id: senthissen...@gmail.com


was (Author: senthh):
[~dipanjanK] Include me too pls

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35623) Volcano resource manager for Spark on Kubernetes

2021-09-02 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408750#comment-17408750
 ] 

Senthil Kumar commented on SPARK-35623:
---

[~dipanjanK] Include me too pls

> Volcano resource manager for Spark on Kubernetes
> 
>
> Key: SPARK-35623
> URL: https://issues.apache.org/jira/browse/SPARK-35623
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 3.1.1, 3.1.2
>Reporter: Dipanjan Kailthya
>Priority: Minor
>  Labels: kubernetes, resourcemanager
>
> Dear Spark Developers, 
>   
>  Hello from the Netherlands! Posting this here as I still haven't gotten 
> accepted to post in the spark dev mailing list.
>   
>  My team is planning to use spark with Kubernetes support on our shared 
> (multi-tenant) on premise Kubernetes cluster. However we would like to have 
> certain scheduling features like fair-share and preemption which as we 
> understand are not built into the current spark-kubernetes resource manager 
> yet. We have been working on and are close to a first successful prototype 
> integration with Volcano ([https://volcano.sh/en/docs/]). Briefly this means 
> a new resource manager component with lots in common with existing 
> spark-kubernetes resource manager, but instead of pods it launches Volcano 
> jobs which delegate the driver and executor pod creation and lifecycle 
> management to Volcano. We are interested in contributing this to open source, 
> either directly in spark or as a separate project.
>   
>  So, two questions: 
>   
>  1. Do the spark maintainers see this as a valuable contribution to the 
> mainline spark codebase? If so, can we have some guidance on how to publish 
> the changes? 
>   
>  2. Are any other developers / organizations interested to contribute to this 
> effort? If so, please get in touch.
>   
>  Best,
>  Dipanjan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-01 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36643:
--
Component/s: SQL

> Add more information in ERROR log while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
> --
>
> Key: SPARK-36643
> URL: https://issues.apache.org/jira/browse/SPARK-36643
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as 
> true in Spark 3.* versions int order to avoid changing Spark Confs. But from 
> the error message we get confused if we can not modify/change Spark conf in 
> Spark 3.* or not.
> Current Error Message :
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> modify the value of a Spark config: spark.driver.host
>  at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
>  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
>  
> So adding little more information( how to modify Spark Conf), in ERROR log 
> while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful 
> to avoid confusions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408282#comment-17408282
 ] 

Senthil Kumar edited comment on SPARK-36643 at 9/1/21, 5:20 PM:


New ERROR message will be as below,

 
{code:java}
scala> spark.conf.set("spark.driver.host", "localhost") 
org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark 
config: spark.driver.host, please set 
spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make 
change value of Spark config: spark.driver.host .
 at 
org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336)
 at 
org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157)
 at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
... 47 elided{code}
 


was (Author: senthh):
New ERROR message will be as below,

 
{code:java}
scala> spark.conf.set("spark.driver.host", "localhost") 
org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark 
config: spark.driver.host, please set 
spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make 
change value of Spark config: spark.driver.host .
 at 
org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336)
 at 
org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157)
 at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41){code}

 ... 47 elided

> Add more information in ERROR log while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
> --
>
> Key: SPARK-36643
> URL: https://issues.apache.org/jira/browse/SPARK-36643
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as 
> true in Spark 3.* versions int order to avoid changing Spark Confs. But from 
> the error message we get confused if we can not modify/change Spark conf in 
> Spark 3.* or not.
> Current Error Message :
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> modify the value of a Spark config: spark.driver.host
>  at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
>  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
>  
> So adding little more information( how to modify Spark Conf), in ERROR log 
> while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful 
> to avoid confusions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408282#comment-17408282
 ] 

Senthil Kumar commented on SPARK-36643:
---

New ERROR message will be as below,

 
{code:java}
scala> spark.conf.set("spark.driver.host", "localhost") 
org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark 
config: spark.driver.host, please set 
spark.sql.legacy.setCommandRejectsSparkCoreConfs as 'false' in order to make 
change value of Spark config: spark.driver.host .
 at 
org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:2336)
 at 
org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157)
 at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41){code}

 ... 47 elided

> Add more information in ERROR log while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
> --
>
> Key: SPARK-36643
> URL: https://issues.apache.org/jira/browse/SPARK-36643
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as 
> true in Spark 3.* versions int order to avoid changing Spark Confs. But from 
> the error message we get confused if we can not modify/change Spark conf in 
> Spark 3.* or not.
> Current Error Message :
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> modify the value of a Spark config: spark.driver.host
>  at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
>  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
>  
> So adding little more information( how to modify Spark Conf), in ERROR log 
> while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful 
> to avoid confusions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408274#comment-17408274
 ] 

Senthil Kumar commented on SPARK-36643:
---

Creating PR for this

> Add more information in ERROR log while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
> --
>
> Key: SPARK-36643
> URL: https://issues.apache.org/jira/browse/SPARK-36643
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as 
> true in Spark 3.* versions int order to avoid changing Spark Confs. But from 
> the error message we get confused if we can not modify/change Spark conf in 
> Spark 3.* or not.
> Current Error Message :
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> modify the value of a Spark config: spark.driver.host
>  at 
> org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
>  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
>  
> So adding little more information( how to modify Spark Conf), in ERROR log 
> while SparkConf is modified when 
> spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful 
> to avoid confusions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36643) Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

2021-09-01 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36643:
-

 Summary: Add more information in ERROR log while SparkConf is 
modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
 Key: SPARK-36643
 URL: https://issues.apache.org/jira/browse/SPARK-36643
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Senthil Kumar


Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true 
in Spark 3.* versions int order to avoid changing Spark Confs. But from the 
error message we get confused if we can not modify/change Spark conf in Spark 
3.* or not.

Current Error Message :
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
modify the value of a Spark config: spark.driver.host
 at 
org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:156)
 at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:40){code}
 

So adding little more information( how to modify Spark Conf), in ERROR log 
while SparkConf is modified when 
spark.sql.legacy.setCommandRejectsSparkCoreConfs is 'true', will be helpful to 
avoid confusions.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36604) timestamp type column analyze result is wrong

2021-08-30 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407068#comment-17407068
 ] 

Senthil Kumar commented on SPARK-36604:
---

[~yghu] I tested this scenario in Spark2.4, but I don't see this issue is 
occurring.   Are you seeing this issue only in Spark 3.1.1? 

 

 
{panel}


 

_scala> spark.sql("create table c(a timestamp)")_

_res16: org.apache.spark.sql.DataFrame = []_

 __ 

_scala> spark.sql("insert into c select '2021-08-15 15:30:01'")_

_res17: org.apache.spark.sql.DataFrame = []_

 __ 

_scala> spark.sql("analyze table c compute statistics for columns a")_

_res18: org.apache.spark.sql.DataFrame = []_

 __ 

_scala> spark.sql("desc formatted c a").show(true)_

_+--++_

_|     info_name|          info_value|_

_+--++_

_|      col_name|                   a|_

_|     data_type|           timestamp|_

_|       comment|                NULL|_

_|           min|2021-08-15 15:30:...|_

_|           max|2021-08-15 15:30:...|_

_|     num_nulls|                   0|_

_|distinct_count|                   1|_

_|   avg_col_len|                   8|_

_|   max_col_len|                   8|_

_|     histogram|                NULL|_

_+--++_

 
{panel}
 

> timestamp type column analyze result is wrong
> -
>
> Key: SPARK-36604
> URL: https://issues.apache.org/jira/browse/SPARK-36604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1
>Reporter: YuanGuanhu
>Priority: Major
>
> when we create table with timestamp column type, the min and max data of the 
> analyze result for the timestamp column is wrong
> eg:
> {code}
> > select * from a;
> {code}
> {code}
> 2021-08-15 15:30:01
> Time taken: 2.789 seconds, Fetched 1 row(s)
> spark-sql> desc formatted a a;
> col_name a
> data_type timestamp
> comment NULL
> min 2021-08-15 07:30:01.00
> max 2021-08-15 07:30:01.00
> num_nulls 0
> distinct_count 1
> avg_col_len 8
> max_col_len 8
> histogram NULL
> Time taken: 0.278 seconds, Fetched 10 row(s)
> spark-sql> desc a;
> a timestamp NULL
> Time taken: 1.432 seconds, Fetched 1 row(s)
> {code}
>  
> reproduce step:
> {code}
> create table a(a timestamp);
> insert into a select '2021-08-15 15:30:01';
> analyze table a compute statistics for columns a;
> desc formatted a a;
> select * from a;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36412) Add Test Coverage to meet viewFs(Hadoop federation) scenario

2021-08-04 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36412:
--
Summary: Add Test  Coverage to meet viewFs(Hadoop federation) scenario  
(was: Create coverage Test to meet viewFs(Hadoop federation) scenario)

> Add Test  Coverage to meet viewFs(Hadoop federation) scenario
> -
>
> Key: SPARK-36412
> URL: https://issues.apache.org/jira/browse/SPARK-36412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Create coverage Test to meet viewFs(Hadoop federation) scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36412) Create coverage Test to meet viewFs(Hadoop federation) scenario

2021-08-04 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392976#comment-17392976
 ] 

Senthil Kumar commented on SPARK-36412:
---

I am working on this

> Create coverage Test to meet viewFs(Hadoop federation) scenario
> ---
>
> Key: SPARK-36412
> URL: https://issues.apache.org/jira/browse/SPARK-36412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Major
>
> Create coverage Test to meet viewFs(Hadoop federation) scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36412) Create coverage Test to meet viewFs(Hadoop federation) scenario

2021-08-04 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36412:
-

 Summary: Create coverage Test to meet viewFs(Hadoop federation) 
scenario
 Key: SPARK-36412
 URL: https://issues.apache.org/jira/browse/SPARK-36412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Senthil Kumar


Create coverage Test to meet viewFs(Hadoop federation) scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-30 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390851#comment-17390851
 ] 

Senthil Kumar commented on SPARK-36327:
---

Hi [~sunchao]

Hive is creating .staging directories inside "/db/table" location but Spark-sql 
creates .staging directories inside /db/" location when we use hadoop 
federation(viewFs). But works as expected (creating .staging inside /db/table/ 
location for other filesystems like hdfs).

HIVE:
{{
# beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO : Loading data to table dicedb.part_test partition (j=1) from 
**viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-1**

}}

but spark's behaviour,

{{
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
... 
}}


The reason why we require this change is , if we allow spark-sql to create 
.staging directory inside /db/ location then we will end-up with security 
issues. We need to provide permission for "viewfs:///db/" location to all users 
who submit spark jobs.

After this change is applied spark-sql creates .staging inside /db/table/, 
similar to hive, as below,

{{
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, 
current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: 
**viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
 
}}

The reason why we don't see this issue in Hive but only occurs in Spark-sql:

In hive, "/db/table/tmp" directory structure is passed for path and hence 
path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so 
it is not required to use "path.getParent" for hadoop federation(viewfs)

 

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-29 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389827#comment-17389827
 ] 

Senthil Kumar commented on SPARK-36327:
---

Hi [~dongjoon],  [~hyukjin.kwon]

 

Could you please review these minor changes? 

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-29 Thread Senthil Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Senthil Kumar updated SPARK-36327:
--
Component/s: SQL

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-29 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389814#comment-17389814
 ] 

Senthil Kumar commented on SPARK-36327:
---

Created PR https://github.com/apache/spark/pull/33577

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-28 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388597#comment-17388597
 ] 

Senthil Kumar commented on SPARK-36327:
---

Shall I work on this Jira to fix this issue?

> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory
> ---
>
> Key: SPARK-36327
> URL: https://issues.apache.org/jira/browse/SPARK-36327
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Senthil Kumar
>Priority: Minor
>
> Spark sql creates staging dir inside database directory rather than creating 
> inside table directory.
>  
> This arises only when viewfs:// is configured. When the location is hdfs://, 
> it doesn't occur.
>  
> Based on further investigation in file *SaveAsHiveFile.scala*, I could see 
> that the directory hierarchy has been not properly handled for viewFS 
> condition.
> Parent path(db path) is passed rather than passing the actual directory(table 
> location).
> {{
> // Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
> private def newVersionExternalTempPath(
> path: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
> val extURI: URI = path.toUri
> if (extURI.getScheme == "viewfs")
> { getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }
> else
> { new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), 
> "-ext-1") }
> }
> }}
> Please refer below lines
> ===
> if (extURI.getScheme == "viewfs") {
> getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
> ===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36327) Spark sql creates staging dir inside database directory rather than creating inside table directory

2021-07-28 Thread Senthil Kumar (Jira)
Senthil Kumar created SPARK-36327:
-

 Summary: Spark sql creates staging dir inside database directory 
rather than creating inside table directory
 Key: SPARK-36327
 URL: https://issues.apache.org/jira/browse/SPARK-36327
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Senthil Kumar


Spark sql creates staging dir inside database directory rather than creating 
inside table directory.

 

This arises only when viewfs:// is configured. When the location is hdfs://, it 
doesn't occur.

 

Based on further investigation in file *SaveAsHiveFile.scala*, I could see that 
the directory hierarchy has been not properly handled for viewFS condition.
Parent path(db path) is passed rather than passing the actual directory(table 
location).

{{
// Mostly copied from Context.java#getExternalTmpPath of Hive 1.2
private def newVersionExternalTempPath(
path: Path,
hadoopConf: Configuration,
stagingDir: String): Path = {
val extURI: URI = path.toUri
if (extURI.getScheme == "viewfs")

{ getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir) }

else

{ new Path(getExternalScratchDir(extURI, hadoopConf, stagingDir), "-ext-1") 
}

}
}}

Please refer below lines

===
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36008) POM errors

2021-07-03 Thread Abhinav Kumar (Jira)
Abhinav Kumar created SPARK-36008:
-

 Summary: POM errors
 Key: SPARK-36008
 URL: https://issues.apache.org/jira/browse/SPARK-36008
 Project: Spark
  Issue Type: Question
  Components: Build
Affects Versions: 3.1.2, 3.1.1
Reporter: Abhinav Kumar


Usage is ZincServer within scala-maven-plugin is not allowed.

true -> results in error

 

Similar issue within mvn_scalafmt_2.12

${scalafmt.parameters} 
${scalafmt.skip} 

 

Should these be removed?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35740) ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15

2021-07-01 Thread Abhinav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Kumar resolved SPARK-35740.
---
Resolution: Not A Problem

> ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15
> 
>
> Key: SPARK-35740
> URL: https://issues.apache.org/jira/browse/SPARK-35740
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: Abhinav Kumar
>Priority: Minor
>
> import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions} 
> ChromeDriver is not getting resolved.
>  
> Also - Class org.openqa.selenium.remote.RemoteWebDriver not found - 
> continuing with a stub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35740) ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15

2021-07-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372513#comment-17372513
 ] 

Abhinav Kumar edited comment on SPARK-35740 at 7/1/21, 8:16 AM:


This can be resolved. It works as suggested.


was (Author: abhinavofficial):
This

> ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15
> 
>
> Key: SPARK-35740
> URL: https://issues.apache.org/jira/browse/SPARK-35740
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: Abhinav Kumar
>Priority: Minor
>
> import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions} 
> ChromeDriver is not getting resolved.
>  
> Also - Class org.openqa.selenium.remote.RemoteWebDriver not found - 
> continuing with a stub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35740) ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15

2021-07-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372513#comment-17372513
 ] 

Abhinav Kumar commented on SPARK-35740:
---

This

> ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15
> 
>
> Key: SPARK-35740
> URL: https://issues.apache.org/jira/browse/SPARK-35740
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: Abhinav Kumar
>Priority: Minor
>
> import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions} 
> ChromeDriver is not getting resolved.
>  
> Also - Class org.openqa.selenium.remote.RemoteWebDriver not found - 
> continuing with a stub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Pushkar Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368523#comment-17368523
 ] 

Pushkar Kumar commented on SPARK-19256:
---

Thank you [~chengsu]!!

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing write support

2021-06-23 Thread Pushkar Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368089#comment-17368089
 ] 

Pushkar Kumar commented on SPARK-19256:
---

Hi [~chengsu] , Could you please update us here.

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35757) Add bitwise AND operation to BitArray and add intersect AND operation for bloom filters

2021-06-14 Thread Dhruv Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruv Kumar updated SPARK-35757:

Issue Type: Improvement  (was: New Feature)

> Add bitwise AND operation to BitArray and add intersect AND operation for 
> bloom filters
> ---
>
> Key: SPARK-35757
> URL: https://issues.apache.org/jira/browse/SPARK-35757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Dhruv Kumar
>Priority: Trivial
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> This issue will add
>  # a bitwise AND operation to BitArray (similar to existing `putAll` method)
>  # an intersect operation for combining bloom filters using bitwise AND 
> operation (similar to existing `mergeInPlace` method). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35757) Add bitwise AND operation to BitArray and add intersect AND operation for bloom filters

2021-06-14 Thread Dhruv Kumar (Jira)
Dhruv Kumar created SPARK-35757:
---

 Summary: Add bitwise AND operation to BitArray and add intersect 
AND operation for bloom filters
 Key: SPARK-35757
 URL: https://issues.apache.org/jira/browse/SPARK-35757
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Dhruv Kumar


This issue will add
 # a bitwise AND operation to BitArray (similar to existing `putAll` method)
 # an intersect operation for combining bloom filters using bitwise AND 
operation (similar to existing `mergeInPlace` method). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35740) ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15

2021-06-11 Thread Abhinav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Kumar updated SPARK-35740:
--
Description: 
import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions} 

ChromeDriver is not getting resolved.

 

Also - Class org.openqa.selenium.remote.RemoteWebDriver not found - continuing 
with a stub.

  was:
import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions}

 

ChromeDriver is not getting resolved.


> ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15
> 
>
> Key: SPARK-35740
> URL: https://issues.apache.org/jira/browse/SPARK-35740
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: Abhinav Kumar
>Priority: Minor
>
> import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions} 
> ChromeDriver is not getting resolved.
>  
> Also - Class org.openqa.selenium.remote.RemoteWebDriver not found - 
> continuing with a stub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35740) ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15

2021-06-11 Thread Abhinav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Kumar updated SPARK-35740:
--
Summary: ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 
10.15  (was: ChromeUIHistoryServerSuite failed on Mac 10.15)

> ChromeUIHistoryServerSuite and ChromeUISeleniumSuite failed on Mac 10.15
> 
>
> Key: SPARK-35740
> URL: https://issues.apache.org/jira/browse/SPARK-35740
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.1
>Reporter: Abhinav Kumar
>Priority: Minor
>
> import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions}
>  
> ChromeDriver is not getting resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35740) ChromeUIHistoryServerSuite failed on Mac 10.15

2021-06-11 Thread Abhinav Kumar (Jira)
Abhinav Kumar created SPARK-35740:
-

 Summary: ChromeUIHistoryServerSuite failed on Mac 10.15
 Key: SPARK-35740
 URL: https://issues.apache.org/jira/browse/SPARK-35740
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1.1
Reporter: Abhinav Kumar


import org.openqa.selenium.chrome.\{ChromeDriver, ChromeOptions}

 

ChromeDriver is not getting resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10388) Public dataset loader interface

2021-05-25 Thread Gaurav Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Kumar updated SPARK-10388:
-
Comment: was deleted

(was: I want to work on this issue [~mengxr], yet I am new to opensource. I 
would love to hear from you.)

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2021-05-25 Thread Gaurav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351097#comment-17351097
 ] 

Gaurav Kumar commented on SPARK-10388:
--

I want to work on this issue [~mengxr], yet I am new to opensource. I would 
love to hear from you.

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   >