date:20191111

[jira] [Updated] (SPARK-29815) Missing persist in ml.tuning.CrossValidator.fit()

2019-11-11 Thread Dong Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29815:
--
Description: 
dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will 
generate two rdds: training and validation. Some actions will be operated on 
these two rdds, but dataset.toDF.rdd is not persisted, which will cause 
recomputation.

{code:scala}
// Compute metrics for each model over each split
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // 
dataset.toDF.rdd should be persisted
val metrics = splits.zipWithIndex.map { case ((training, validation), 
splitIndex) =>
  val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
  val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will 
generate two rdds: training and validation. Some actions will be operated on 
these two rdds, but dataset.toDF.rdd is not persisted, which will cause 
recomputation.

{code:scala}
// Compute metrics for each model over each split
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // 
dataset.toDF.rdd should be persisted
val metrics = splits.zipWithIndex.map { case ((training, validation), 
splitIndex) =>
  val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
  val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()
{scala}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Missing persist in ml.tuning.CrossValidator.fit()
> -
>
> Key: SPARK-29815
> URL: https://issues.apache.org/jira/browse/SPARK-29815
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will 
> generate two rdds: training and validation. Some actions will be operated on 
> these two rdds, but dataset.toDF.rdd is not persisted, which will cause 
> recomputation.
> {code:scala}
> // Compute metrics for each model over each split
> val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // 
> dataset.toDF.rdd should be persisted
> val metrics = splits.zipWithIndex.map { case ((training, validation), 
> splitIndex) =>
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29816) Missing persist in mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()

2019-11-11 Thread Dong Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29816:
--
Description: 
The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and 
count(), so it needs to be persisted.

{code:scala}
val counts = scoreAndLabels.combineByKey(
  createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += 
label,
  mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
  mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
).sortByKey(ascending = false) // first use

val binnedCounts =
  // Only down-sample if bins is > 0
  if (numBins == 0) {
// Use original directly
counts
  } else {
val countsSize = counts.count() //second use
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and 
count(), so it needs to be persisted.

{code:scala}
val counts = scoreAndLabels.combineByKey(
  createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += 
label,
  mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
  mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
+= c2
).sortByKey(ascending = false) // first use

val binnedCounts =
  // Only down-sample if bins is > 0
  if (numBins == 0) {
// Use original directly
counts
  } else {
val countsSize = counts.count() //second use
{scala}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Missing persist in 
> mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()
> ---
>
> Key: SPARK-29816
> URL: https://issues.apache.org/jira/browse/SPARK-29816
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Minor
>
> The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and 
> count(), so it needs to be persisted.
> {code:scala}
> val counts = scoreAndLabels.combineByKey(
>   createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += 
> label,
>   mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
>   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
> += c2
> ).sortByKey(ascending = false) // first use
> val binnedCounts =
>   // Only down-sample if bins is > 0
>   if (numBins == 0) {
> // Use original directly
> counts
>   } else {
> val countsSize = counts.count() //second use
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-11 Thread Dong Wang (Jira)

Dong Wang created SPARK-29856:
-

 Summary: Conditional unnecessary persist on RDDs in ML algorithms
 Key: SPARK-29856
 URL: https://issues.apache.org/jira/browse/SPARK-29856
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: Dong Wang


When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
_{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
persisted, but it only used once. So this persist operation is unnecessary.

{code:scala}
val baggedInput = BaggedPoint
  .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
withReplacement,
(tp: TreePoint) => tp.weight, seed = seed)
  .persist(StorageLevel.MEMORY_AND_DISK)
  ...
   while (nodeStack.nonEmpty) {
  ...
  timer.start("findBestSplits")
  RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
nodesForGroup,
treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
  timer.stop("findBestSplits")
}
baggedInput.unpersist()
{code}

However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. 
In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
one action uses {color:#DE350B}_baggedInput_{color}.
In most of ML applications, the loop will executes for many times, which means 
{color:#DE350B}_baggedInput_{color} will be used in many actions. So the 
persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.

Same situations exist in many other ML algorithms, e.g., RDD 
{color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
{color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29001) Print better log when process of events becomes slow

2019-11-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29001.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25702
[https://github.com/apache/spark/pull/25702]

> Print better log when process of events becomes slow
> 
>
> Key: SPARK-29001
> URL: https://issues.apache.org/jira/browse/SPARK-29001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
> Fix For: 3.0.0
>
>
> We shall print better log when process of events becomes slow, to help find 
> out what type of events is slow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands

2019-11-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29519:
---

Assignee: Pablo Langa Blanco

> SHOW TBLPROPERTIES should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29519
> URL: https://issues.apache.org/jira/browse/SPARK-29519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
>
> SHOW TBLPROPERTIES should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands

2019-11-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29519.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26176
[https://github.com/apache/spark/pull/26176]

> SHOW TBLPROPERTIES should look up catalog/table like v2 commands
> 
>
> Key: SPARK-29519
> URL: https://issues.apache.org/jira/browse/SPARK-29519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.0.0
>
>
> SHOW TBLPROPERTIES should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29855) typed literals with negative sign with proper result or exception

2019-11-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-29855:


 Summary: typed literals with negative sign with proper result or 
exception
 Key: SPARK-29855
 URL: https://issues.apache.org/jira/browse/SPARK-29855
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


{code:java}
-- !query 83
select -integer '7'
-- !query 83 schema
struct<7:int>
-- !query 83 output
7

-- !query 86
select -date '1999-01-01'
-- !query 86 schema
struct
-- !query 86 output
1999-01-01

-- !query 87
select -timestamp '1999-01-01'
-- !query 87 schema
struct
-- !query 87 output
1999-01-01 00:00:00
{code}

the integer should be -7 and the date and timestamp results are confusing which 
should throw exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29853) lpad returning empty instead of NULL for empty pad value

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972079#comment-16972079
 ] 

Ankit Raj Boudh commented on SPARK-29853:
-

@[~hyukjin.kwon] , please review the PR.

> lpad returning empty instead of NULL for empty pad value
> 
>
> Key: SPARK-29853
> URL: https://issues.apache.org/jira/browse/SPARK-29853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark
> 0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, '');
> ++--+
> | lpad(hi, 5, ) |
> ++--+
> | hi |
> ++--+
> 1 row selected (0.186 seconds)
> Hive:
> INFO : Concurrency mode is disabled, not creating a lock manager
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29854:
-
Description: 
Spark Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++


++

Hive:

SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 

  was:
Spark Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++
| |

++

Hive:

SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 


> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29854:
-
Description: 
Spark Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
 ++
|lpad(hihhh, CAST(5000 AS INT), 
)|

++
| |

++

Hive:

SELECT lpad('hihhh', 5000, 
'');
 Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 

  was:
Spark:( Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
++
| lpad(hihhh, CAST(5000 AS INT), 
) |
++
| |
++

Hive:

SELECT lpad('hihhh', 5000, 
'');
Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 


> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> | |
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972076#comment-16972076
 ] 

Ankit Raj Boudh commented on SPARK-29854:
-

i will raise PR for this.

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark:( Returns Empty String)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
> ++
> | lpad(hihhh, CAST(5000 AS INT), 
> ) |
> ++
> | |
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29854:


 Summary: lpad and rpad built in function not throw Exception for 
invalid len value
 Key: SPARK-29854
 URL: https://issues.apache.org/jira/browse/SPARK-29854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark:( Returns Empty String)

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
lpad('hihhh', 5000, '');
++
| lpad(hihhh, CAST(5000 AS INT), 
) |
++
| |
++

Hive:

SELECT lpad('hihhh', 5000, 
'');
Error: Error while compiling statement: FAILED: SemanticException [Error 
10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)

PostgreSQL

function lpad(unknown, numeric, unknown) does not exist

 

Expected output:

In Spark also it should throw Exception like Hive

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29853) lpad returning empty instead of NULL for empty pad value

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972073#comment-16972073
 ] 

Ankit Raj Boudh commented on SPARK-29853:
-

i will PR for this

> lpad returning empty instead of NULL for empty pad value
> 
>
> Key: SPARK-29853
> URL: https://issues.apache.org/jira/browse/SPARK-29853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark
> 0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, '');
> ++--+
> | lpad(hi, 5, ) |
> ++--+
> | hi |
> ++--+
> 1 row selected (0.186 seconds)
> Hive:
> INFO : Concurrency mode is disabled, not creating a lock manager
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29853) lpad returning empty instead of NULL for empty pad value

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29853:


 Summary: lpad returning empty instead of NULL for empty pad value
 Key: SPARK-29853
 URL: https://issues.apache.org/jira/browse/SPARK-29853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark

0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, '');
++--+
| lpad(hi, 5, ) |
++--+
| hi |
++--+
1 row selected (0.186 seconds)

Hive:

INFO : Concurrency mode is disabled, not creating a lock manager
+---+
| _c0 |
+---+
| NULL |
+---+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972035#comment-16972035
 ] 

Ankit Raj Boudh edited comment on SPARK-29776 at 11/12/19 5:32 AM:
---

[https://github.com/apache/spark/pull/26477] , [~hyukjin.kwon] please review 
this PR


was (Author: ankitraj):
yes, today i will submit PR for this.

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> {code}
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> {code}
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-11 Thread Wesley Hoffman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972048#comment-16972048
 ] 

Wesley Hoffman commented on SPARK-29778:


Looks like my PR linked right up. How handy!

 

The test works by creating a custom query listener so that I can gather the 
{{LogicalPlan}} and assert the proper {{writeOptions.}}

> saveAsTable append mode is not passing writer options
> -
>
> Key: SPARK-29778
> URL: https://issues.apache.org/jira/browse/SPARK-29778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Critical
>
> There was an oversight where AppendData is not getting the WriterOptions in 
> saveAsTable. 
> [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-11 Thread Ke Jia (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-29792:
---
Description: After merged 
[SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], the subqueries 
info can not be updated in AQE. And this Jira will fix it.

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972035#comment-16972035
 ] 

Ankit Raj Boudh commented on SPARK-29776:
-

yes, today i will submit PR for this.

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> {code}
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> {code}
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-11 Thread zhao bo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972019#comment-16972019
 ] 

zhao bo commented on SPARK-29106:
-

The arm worker is back. Sorry for late.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29808) StopWordsRemover should support multi-cols

2019-11-11 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29808:


Assignee: Huaxin Gao

> StopWordsRemover should support multi-cols
> --
>
> Key: SPARK-29808
> URL: https://issues.apache.org/jira/browse/SPARK-29808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> As a basic Transformer, StopWordsRemover should support multi-cols.
> Param {color:#93a6f5}stopWords{color} can be applied across all columns.
> {color:#93a6f5} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading

2019-11-11 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-29851:
--
Summary: V2 Catalog: Default behavior of dropping namespace is cascading  
(was: DataSourceV2: Default behavior of dropping namespace is cascading)

> V2 Catalog: Default behavior of dropping namespace is cascading
> ---
>
> Key: SPARK-29851
> URL: https://issues.apache.org/jira/browse/SPARK-29851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Instead of introducing additional 'cascade' option to dropNamespace(), the 
> default behavior of dropping a namespace will be cascading. Now, to implement 
> the cascade option, Spark side needs to ensure a namespace is empty before 
> calling dropNamespace().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator

2019-11-11 Thread Peng Cheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated SPARK-29852:
---
Issue Type: Improvement  (was: New Feature)

> Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
> -
>
> Key: SPARK-29852
> URL: https://issues.apache.org/jira/browse/SPARK-29852
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Peng Cheng
>Priority: Major
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Both RDD and Dataset APIs have 2 methods of collecting data from executors to 
> driver:
>  
>  # .collect() setup multiple threads in a job and dump all data from executor 
> into drivers memory. This is great if data on driver needs to be accessible 
> ASAP, but not as efficient if access to partitions can only happen 
> sequentially, and outright risky if driver doesn't have enough memory to hold 
> all data.
> - the solution for issue SPARK-25224 partially alleviate this by delaying 
> deserialisation of data in InternalRow format, such that only the much 
> smaller serialised data needs to be entirely hold by driver memory. This 
> solution does not abide O(1) memory consumption, thus does not scale to 
> arbitrarily large dataset
>  # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of 
> the next partition does not start until sequential access to previous 
> partition has concluded. This action abides O(1) memory consumption and is 
> great if access to data is sequential and significantly slower than the speed 
> where partitions can be shipped from a single executor, with 1 thread. It 
> becomes inefficient when the sequential access to data has to wait for a 
> relatively long time for the shipping of the next partition
> The proposed solution is a crossover between two existing implementations: a 
> concurrent subroutine that is both CPU and memory bounded. The solution 
> allocate a fixed sized resource pool (by default = number of available CPU 
> cores) that serves the shipping of partitions concurrently, and block 
> sequential access to partitions' data until shipping is finished (which 
> usually happens without blocking for partitionID >=2 due to the fact that 
> shipping start much earlier and preemptively). Tenants of the resource pool 
> can be GC'ed and evicted once sequential access to it's data has finished, 
> which allows more partitions to be fetched much earlier than they are 
> accessed. The maximum memory consumption is O(m * n), where m is the 
> predefined concurrency and n is the size of the largest partition.
> The following scala code snippet demonstrates a simple implementation:
>  
> (requires scala 2.11 + and ScalaTests)
>  
> {code:java}
> package org.apache.spark.spike
> import java.util.concurrent.ArrayBlockingQueue
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{FutureAction, SparkContext}
> import org.scalatest.FunSpec
> import scala.concurrent.Future
> import scala.language.implicitConversions
> import scala.reflect.ClassTag
> import scala.util.{Failure, Success, Try}
> class ToLocalIteratorPreemptivelySpike extends FunSpec {
>   import ToLocalIteratorPreemptivelySpike._
>   lazy val sc: SparkContext = 
> SparkSession.builder().master("local[*]").getOrCreate().sparkContext
>   it("can be much faster than toLocalIterator") {
> val max = 80
> val delay = 100
> val slowRDD = sc.parallelize(1 to max, 8).map { v =>
>   Thread.sleep(delay)
>   v
> }
> val (r1, t1) = timed {
>   slowRDD.toLocalIterator.toList
> }
> val capacity = 4
> val (r2, t2) = timed {
>   slowRDD.toLocalIteratorPreemptively(capacity).toList
> }
> assert(r1 == r2)
> println(s"linear: $t1, preemptive: $t2")
> assert(t1 > t2 * 2)
> assert(t2 > max * delay / capacity)
>   }
> }
> object ToLocalIteratorPreemptivelySpike {
>   case class PartitionExecution[T: ClassTag](
>   @transient self: RDD[T],
>   id: Int
>   ) {
> def eager: this.type = {
>   AsArray.future
>   this
> }
> case object AsArray {
>   @transient lazy val future: FutureAction[Array[T]] = {
> var result: Array[T] = null
> val future = self.context.submitJob[T, Array[T], Array[T]](
>   self,
>   _.toArray,
>   Seq(id), { (_, data) =>
> result = data
>   },
>   result
> )
> future
>   }
>   @transient lazy val now: Array[T] = future.get()
> }
>   }
>   implicit class RDDFunctions[T: ClassTag](self: RDD[T]) {
> import scala.concurrent.ExecutionContext.Implicits.global
> def _toLocalIteratorPreemptively(cap

[jira] [Created] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator

2019-11-11 Thread Peng Cheng (Jira)

Peng Cheng created SPARK-29852:
--

 Summary: Implement parallel preemptive RDD.toLocalIterator and 
Dataset.toLocalIterator
 Key: SPARK-29852
 URL: https://issues.apache.org/jira/browse/SPARK-29852
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Peng Cheng


Both RDD and Dataset APIs have 2 methods of collecting data from executors to 
driver:

 
 # .collect() setup multiple threads in a job and dump all data from executor 
into drivers memory. This is great if data on driver needs to be accessible 
ASAP, but not as efficient if access to partitions can only happen 
sequentially, and outright risky if driver doesn't have enough memory to hold 
all data.

- the solution for issue SPARK-25224 partially alleviate this by delaying 
deserialisation of data in InternalRow format, such that only the much smaller 
serialised data needs to be entirely hold by driver memory. This solution does 
not abide O(1) memory consumption, thus does not scale to arbitrarily large 
dataset
 # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of 
the next partition does not start until sequential access to previous partition 
has concluded. This action abides O(1) memory consumption and is great if 
access to data is sequential and significantly slower than the speed where 
partitions can be shipped from a single executor, with 1 thread. It becomes 
inefficient when the sequential access to data has to wait for a relatively 
long time for the shipping of the next partition

The proposed solution is a crossover between two existing implementations: a 
concurrent subroutine that is both CPU and memory bounded. The solution 
allocate a fixed sized resource pool (by default = number of available CPU 
cores) that serves the shipping of partitions concurrently, and block 
sequential access to partitions' data until shipping is finished (which usually 
happens without blocking for partitionID >=2 due to the fact that shipping 
start much earlier and preemptively). Tenants of the resource pool can be GC'ed 
and evicted once sequential access to it's data has finished, which allows more 
partitions to be fetched much earlier than they are accessed. The maximum 
memory consumption is O(m * n), where m is the predefined concurrency and n is 
the size of the largest partition.

The following scala code snippet demonstrates a simple implementation:

 

(requires scala 2.11 + and ScalaTests)

 
{code:java}
package org.apache.spark.spike

import java.util.concurrent.ArrayBlockingQueue

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.{FutureAction, SparkContext}
import org.scalatest.FunSpec

import scala.concurrent.Future
import scala.language.implicitConversions
import scala.reflect.ClassTag
import scala.util.{Failure, Success, Try}

class ToLocalIteratorPreemptivelySpike extends FunSpec {

  import ToLocalIteratorPreemptivelySpike._

  lazy val sc: SparkContext = 
SparkSession.builder().master("local[*]").getOrCreate().sparkContext

  it("can be much faster than toLocalIterator") {

val max = 80
val delay = 100

val slowRDD = sc.parallelize(1 to max, 8).map { v =>
  Thread.sleep(delay)
  v
}

val (r1, t1) = timed {
  slowRDD.toLocalIterator.toList
}

val capacity = 4
val (r2, t2) = timed {
  slowRDD.toLocalIteratorPreemptively(capacity).toList
}

assert(r1 == r2)
println(s"linear: $t1, preemptive: $t2")
assert(t1 > t2 * 2)
assert(t2 > max * delay / capacity)
  }
}

object ToLocalIteratorPreemptivelySpike {

  case class PartitionExecution[T: ClassTag](
  @transient self: RDD[T],
  id: Int
  ) {

def eager: this.type = {
  AsArray.future
  this
}

case object AsArray {

  @transient lazy val future: FutureAction[Array[T]] = {
var result: Array[T] = null

val future = self.context.submitJob[T, Array[T], Array[T]](
  self,
  _.toArray,
  Seq(id), { (_, data) =>
result = data
  },
  result
)

future
  }

  @transient lazy val now: Array[T] = future.get()
}
  }

  implicit class RDDFunctions[T: ClassTag](self: RDD[T]) {

import scala.concurrent.ExecutionContext.Implicits.global

def _toLocalIteratorPreemptively(capacity: Int): Iterator[Array[T]] = {
  val executions = self.partitions.indices.map { ii =>
PartitionExecution(self, ii)
  }

  val buffer = new ArrayBlockingQueue[Try[PartitionExecution[T]]](capacity)

  Future {
executions.foreach { exe =>
  buffer.put(Success(exe)) // may be blocking due to capacity
  exe.eager // non-blocking
}
  }.onFailure {
case e: Throwable =>
  buffer.put(Failure(e))
  }

  self

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-11 Thread zhao bo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971971#comment-16971971
 ] 

zhao bo commented on SPARK-29106:
-

Er,, linaro seems in trouble. The VM can not online now. We will try to contact 
the maintainer asap.

Sorry for that.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-11 Thread zhao bo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971968#comment-16971968
 ] 

zhao bo commented on SPARK-29106:
-

Thanks very much, [~shaneknapp] . 

So apologize that the VM is down last night, but it's back now.

Yeah, I also post an issue [1] to get the status of arm support in apache/arrow 
community. There are very little resource to support it, even from community 
and us. So I think pyarrow is difficult to support arm/finish for now.

Also notes: We got some powerful ARM resources which could replace the current 
test VM, and we had test it good to go now. How do you think it? ;)

 

[1] https://issues.apache.org/jira/browse/ARROW-7042

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29851) DataSourceV2: Default behavior of dropping namespace is cascading

2019-11-11 Thread Terry Kim (Jira)

Terry Kim created SPARK-29851:
-

 Summary: DataSourceV2: Default behavior of dropping namespace is 
cascading
 Key: SPARK-29851
 URL: https://issues.apache.org/jira/browse/SPARK-29851
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


Instead of introducing additional 'cascade' option to dropNamespace(), the 
default behavior of dropping a namespace will be cascading. Now, to implement 
the cascade option, Spark side needs to ensure a namespace is empty before 
calling dropNamespace().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29755) ClassCastException occurs when reading events from SHS

2019-11-11 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29755:
--

Assignee: Jungtaek Lim

> ClassCastException occurs when reading events from SHS
> --
>
> Key: SPARK-29755
> URL: https://issues.apache.org/jira/browse/SPARK-29755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Looks like SPARK-28869 triggered a technical issue on jackson-scala: 
> https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges
> {noformat}
> 19/11/05 17:59:23 INFO FsHistoryProvider: Leasing disk manager space for app 
> app-20191105152223- / None...
> 19/11/05 17:59:23 INFO FsHistoryProvider: Parsing 
> /apps/spark/eventlogs/app-20191105152223- to re-build UI...
> 19/11/05 17:59:24 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Long
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.shouldReloadLog(FsHistoryProvider.scala:585)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6(FsHistoryProvider.scala:458)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6$adapted(FsHistoryProvider.scala:444)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>   at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>   at 
> scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:444)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:267)
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:190)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29755) ClassCastException occurs when reading events from SHS

2019-11-11 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29755.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26397
[https://github.com/apache/spark/pull/26397]

> ClassCastException occurs when reading events from SHS
> --
>
> Key: SPARK-29755
> URL: https://issues.apache.org/jira/browse/SPARK-29755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks like SPARK-28869 triggered a technical issue on jackson-scala: 
> https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges
> {noformat}
> 19/11/05 17:59:23 INFO FsHistoryProvider: Leasing disk manager space for app 
> app-20191105152223- / None...
> 19/11/05 17:59:23 INFO FsHistoryProvider: Parsing 
> /apps/spark/eventlogs/app-20191105152223- to re-build UI...
> 19/11/05 17:59:24 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Long
>   at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.shouldReloadLog(FsHistoryProvider.scala:585)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6(FsHistoryProvider.scala:458)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6$adapted(FsHistoryProvider.scala:444)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>   at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>   at 
> scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:444)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:267)
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:190)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-11-11 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-26154.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26108
[https://github.com/apache/spark/pull/26108]

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|r

[jira] [Assigned] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-11-11 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-26154:
--

Assignee: Jungtaek Lim

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 20:09:38.654|4|2018-11

[jira] [Assigned] (SPARK-29766) Aggregate metrics asynchronously in SQL listener

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29766:
-

Assignee: Marcelo Masiero Vanzin

> Aggregate metrics asynchronously in SQL listener
> 
>
> Key: SPARK-29766
> URL: https://issues.apache.org/jira/browse/SPARK-29766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
>
> This is a follow up to SPARK-29562.
> That change made metrics collection faster, and also sped up metrics 
> aggregation. But it is still too slow to execute in an event handler, so we 
> should do it asynchronously to minimize events being dropped by the listener 
> bus.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29766) Aggregate metrics asynchronously in SQL listener

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29766.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26405
[https://github.com/apache/spark/pull/26405]

> Aggregate metrics asynchronously in SQL listener
> 
>
> Key: SPARK-29766
> URL: https://issues.apache.org/jira/browse/SPARK-29766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up to SPARK-29562.
> That change made metrics collection faster, and also sped up metrics 
> aggregation. But it is still too slow to execute in an event handler, so we 
> should do it asynchronously to minimize events being dropped by the listener 
> bus.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29770) Allow setting spark.app.id when spark-submit for Spark on Kubernetes

2019-11-11 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29770.

Resolution: Won't Fix

See comments in PR.

> Allow setting spark.app.id when spark-submit for Spark on Kubernetes
> 
>
> Key: SPARK-29770
> URL: https://issues.apache.org/jira/browse/SPARK-29770
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Liu Runzhong
>Priority: Minor
>  Labels: easyfix
>
> when the user provides  `spark.app.id`  by `spark-submit`, it's actually 
> doing nothing to change the `spark.app.id`, as `spark.app.id` can only be set 
> by `kubernetesAppId` every time, which makes the users feel confused.
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L196]
> Knowing that `spark.app.id` would be labeled to Driver/Executor pods and 
> other resources and the strict limitation of the label values, but I think it 
> would be more flexible to users to decide how to generate the `spark.app.id` 
> by themselves. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29672) remove python2 tests and test infra

2019-11-11 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971868#comment-16971868
 ] 

Shane Knapp commented on SPARK-29672:
-

the PR is complete, all tests pass and i'm waiting for the word to merge it in 
to master!

> remove python2 tests and test infra
> ---
>
> Key: SPARK-29672
> URL: https://issues.apache.org/jira/browse/SPARK-29672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]
> it's time, at least for 3.0+ to remove python 2.7 test support and migrate 
> the test execution framework to python 3.6.
> this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29803) remove all instances of 'from future import print_function'

2019-11-11 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971866#comment-16971866
 ] 

Shane Knapp commented on SPARK-29803:
-

so after reading the python2 EOL roadmap/announcement it looks like (at some 
point in early 2020) that we WILL be dropping python2 support in 3.0+:
https://spark.apache.org/news/plan-for-dropping-python-2-support.html

so, i believe that this ticket should be dealt with as part of the release of 
spark 3.x without python2 support (and spark 2.4 will not be touched).

unless, of course, i'm misunderstanding something...  which could be entirely 
true.  :)

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29850) sort-merge-join an empty table should not memory leak

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29850:
--
Affects Version/s: 2.3.4

> sort-merge-join an empty table should not memory leak
> -
>
> Key: SPARK-29850
> URL: https://issues.apache.org/jira/browse/SPARK-29850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29850) sort-merge-join an empty table should not memory leak

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29850:
--
Affects Version/s: 2.4.4

> sort-merge-join an empty table should not memory leak
> -
>
> Key: SPARK-29850
> URL: https://issues.apache.org/jira/browse/SPARK-29850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system

2019-11-11 Thread Luca Canali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-27189:

Summary: Add Executor metrics and memory usage instrumentation to the 
metrics system  (was: Add Executor level memory usage metrics to the metrics 
system)

> Add Executor metrics and memory usage instrumentation to the metrics system
> ---
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default

2019-11-11 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29805.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26443
[https://github.com/apache/spark/pull/26443]

> Enable nested schema pruning and pruning on expressions by default
> --
>
> Key: SPARK-29805
> URL: https://issues.apache.org/jira/browse/SPARK-29805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29801) ML models unify toString method

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29801.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26439
[https://github.com/apache/spark/pull/26439]

> ML models unify toString method
> ---
>
> Key: SPARK-29801
> URL: https://issues.apache.org/jira/browse/SPARK-29801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ML models should extend \{{toString}} method to expose basic information.
> Current some algs (GBT/RF/LoR) had done this, while others not yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29801) ML models unify toString method

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29801:
-

Assignee: zhengruifeng

> ML models unify toString method
> ---
>
> Key: SPARK-29801
> URL: https://issues.apache.org/jira/browse/SPARK-29801
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> ML models should extend \{{toString}} method to expose basic information.
> Current some algs (GBT/RF/LoR) had done this, while others not yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1

2019-11-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971798#comment-16971798
 ] 

Thomas Graves commented on SPARK-29762:
---

this is actually more complex then you might think because the resource configs 
are just configs.  So you have spark.executor.resource.gpu.amount for instance, 
the corresponding task config would be spark.task.resource.gpu.amount.  Where 
gpu could be any resource.  The way the code is written now if it just grabs 
all the resources and iterates over them in various places and assume you have 
specified a task requirement for each executor resource. 

If you remove that assumption you now have to be careful about what you are 
iterating over and really you have to use the resources from the executor 
configs, not the task configs.  But you still have to read the task configs and 
if a resource isn't there then default it to 1.

> GPU Scheduling - default task resource amount to 1
> --
>
> Key: SPARK-29762
> URL: https://issues.apache.org/jira/browse/SPARK-29762
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Default the task level resource configs (for gpu/fpga, etc) to 1.  So if the 
> user specifies the executor resource then to make it more user friendly lets 
> have the task resource config default to 1.  This is ok right now since we 
> require resources to have an address.  It also matches what we do for the 
> spark.task.cpus configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29825) Add join conditions in join-related tests of SQLQueryTestSuite

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29825:
-

Assignee: Takeshi Yamamuro

> Add join conditions in join-related tests of SQLQueryTestSuite
> --
>
> Key: SPARK-29825
> URL: https://issues.apache.org/jira/browse/SPARK-29825
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29825) Add join conditions in join-related tests of SQLQueryTestSuite

2019-11-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29825.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26459
[https://github.com/apache/spark/pull/26459]

> Add join conditions in join-related tests of SQLQueryTestSuite
> --
>
> Key: SPARK-29825
> URL: https://issues.apache.org/jira/browse/SPARK-29825
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29808) StopWordsRemover should support multi-cols

2019-11-11 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971757#comment-16971757
 ] 

Huaxin Gao commented on SPARK-29808:


I will work on this. [~podongfeng]

> StopWordsRemover should support multi-cols
> --
>
> Key: SPARK-29808
> URL: https://issues.apache.org/jira/browse/SPARK-29808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> As a basic Transformer, StopWordsRemover should support multi-cols.
> Param {color:#93a6f5}stopWords{color} can be applied across all columns.
> {color:#93a6f5} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971739#comment-16971739
 ] 

Hyukjin Kwon commented on SPARK-29776:
--

[~Ankitraj] have you made some progresses on this?

> rpad returning invalid value when parameter is empty
> 
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> {code}
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> {code}
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971740#comment-16971740
 ] 

Hyukjin Kwon commented on SPARK-29773:
--

ping [~aermakov], have you tried this out?

> Unable to process empty ORC files in Hive Table using Spark SQL
> ---
>
> Key: SPARK-29773
> URL: https://issues.apache.org/jira/browse/SPARK-29773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Centos 7, Spark 2.3.1, Hive 2.3.0
>Reporter: Alexander Ermakov
>Priority: Major
>
> Unable to process empty ORC files in Hive Table using Spark SQL. It seems 
> that a problem with class 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits()
> Stack trace:
> {code:java}
> 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct 
> _tech_load_dt from dl_raw.tpaccsieee_ut_data_address]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(_tech_load_dt#1374, 200)
> +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], 
> output=[_tech_load_dt#1374])
>+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation 
> `dl_raw`.`tpaccsieee_ut_data_address`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, 
> address_adm#1309, address_md#1310, adress_doc#1311, building#1312, 
> change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, 
> city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, 
> code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, 
> date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, 
> district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields]   
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324)
> at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeM

[jira] [Commented] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971738#comment-16971738
 ] 

Hyukjin Kwon commented on SPARK-29792:
--

What does this JIRA mean? Can you fill the description

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971737#comment-16971737
 ] 

Hyukjin Kwon commented on SPARK-29797:
--

ping [~isaacm] do you have any suggestion about how we should store the 
metadata? Otherwise, let's leave this closed.

> Read key-value metadata in Parquet files written by Apache Arrow
> 
>
> Key: SPARK-29797
> URL: https://issues.apache.org/jira/browse/SPARK-29797
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, PySpark
>Affects Versions: 2.4.4
> Environment: Apache Arrow 0.14.1 built on Windows x86. 
>  
>Reporter: Isaac Myers
>Priority: Major
>  Labels: features
> Attachments: minimal_working_example.cpp
>
>
> Key-value (user) metadata written to Parquet file from Apache Arrow c++ is 
> not readable in Spark (PySpark or Java API). I can only find field-level 
> metadata dictionaries in the schema and no other functions in the API that 
> indicate the presence of file-level key-value metadata. The attached code 
> demonstrates creation and retrieval of file-level metadata using the Apache 
> Arrow API.
> {code:java}
> #include #include #include #include #include 
> #include #include 
> #include #include 
> //#include 
> int main(int argc, char* argv[]){ /* Create 
> Parquet File **/ arrow::Status st; 
> arrow::MemoryPool* pool = arrow::default_memory_pool();
>  // Create Schema and fields with metadata 
> std::vector> fields;
>  std::unordered_map a_keyval; a_keyval["unit"] = 
> "sec"; a_keyval["note"] = "not the standard millisecond unit"; 
> arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field 
> = arrow::field("a", arrow::int16(), false, a_md.Copy()); 
> fields.push_back(a_field);
>  std::unordered_map b_keyval; b_keyval["unit"] = 
> "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr 
> b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); 
> fields.push_back(b_field);
>  std::shared_ptr schema = arrow::schema(fields);
>  // Add metadata to schema. std::unordered_map 
> schema_keyval; schema_keyval["classification"] = "Type 0"; 
> arrow::KeyValueMetadata schema_md(schema_keyval); schema = 
> schema->AddMetadata(schema_md.Copy());
>  // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; 
> std::vector a_data(rowgroup_size, 0); std::vector 
> b_data(rowgroup_size, 0);
>  for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = 
> rowgroup_size - i; }  arrow::Int16Builder a_bldr(pool); arrow::Int16Builder 
> b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = 
> b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1;
>  st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1;
>  st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1;
>  std::shared_ptr a_arr_ptr; std::shared_ptr 
> b_arr_ptr;
>  arrow::ArrayVector arr_vec; st = a_bldr.Finish(&a_arr_ptr); if (!st.ok()) 
> return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(&b_arr_ptr); if 
> (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr);
>  std::shared_ptr table = arrow::Table::Make(schema, arr_vec);
>  // Test metadata printf("\nMetadata from original schema:\n"); 
> printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", 
> schema->field(1)->metadata()->ToString().c_str());
>  std::shared_ptr table_schema = table->schema(); 
> printf("\nMetadata from schema retrieved from table (should be the 
> same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); 
> printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str());
>  // Open file and write table. std::string file_name = "test.parquet"; 
> std::shared_ptr ostream; st = 
> arrow::io::FileOutputStream::Open(file_name, &ostream); if (!st.ok()) return 
> 1;
>  std::unique_ptr writer; 
> std::shared_ptr props = 
> parquet::default_writer_properties(); st = 
> parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, &writer); if 
> (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if 
> (!st.ok()) return 1;
>  // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = 
> ostream->Close(); if (!st.ok()) return 1;
>  /* Read Parquet File 
> **/
>  // Create new memory pool. Not sure if this is necessary. 
> //arrow::MemoryPool* pool2 = arrow::default_memory_pool();
>  // Open file reader. std::shared_ptr input_file; st 
> = arrow::io::ReadableFile::Open(file_name, pool, &input_file); if (!st.ok()) 
> re

[jira] [Commented] (SPARK-29673) upgrade jenkins pypy to PyPy3.6 v7.2.0

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971736#comment-16971736
 ] 

Hyukjin Kwon commented on SPARK-29673:
--

Thanks [~shaneknapp]. I will make a try soon in probably a couple of weeks.

> upgrade jenkins pypy to PyPy3.6 v7.2.0
> --
>
> Key: SPARK-29673
> URL: https://issues.apache.org/jira/browse/SPARK-29673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29803) remove all instances of 'from future import print_function'

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971735#comment-16971735
 ] 

Hyukjin Kwon commented on SPARK-29803:
--

{quote}
 as spark 3.0+ technically does NOT support python versions earlier than 3.5
{quote}

Hey, [~shaneknapp], just to make sure we're on the same page, I think Spark 3.0 
will still support Python 2.7, 3.4 and 3.5 although they are deprecated.


> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29804) Spark-shell is failing on YARN mode

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971731#comment-16971731
 ] 

Hyukjin Kwon commented on SPARK-29804:
--

Please don't set target version and Critical+ which are usually reserved for 
committers. Also, please don't set fix version which is usually set when it's 
actually fixed.
Lastly, please just don't copy and paste error messages. If this is an issue, 
please provide full details with a reproducer.

If this is a question, it should go to mailing list - 
https://spark.apache.org/community.html

> Spark-shell is failing on YARN mode
> ---
>
> Key: SPARK-29804
> URL: https://issues.apache.org/jira/browse/SPARK-29804
> Project: Spark
>  Issue Type: Question
>  Components: YARN
>Affects Versions: 2.4.4
> Environment: Spark2.4.4, Apache Hadoop 3.1.2
>Reporter: Srujan A
>Priority: Major
>
> I am trying to run the spark-shell on YARN mode from containers and it's 
> failing on below reason. Please help me out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29804) Spark-shell is failing on YARN mode

2019-11-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29804.
--
Fix Version/s: (was: 2.4.4)
   Resolution: Invalid

> Spark-shell is failing on YARN mode
> ---
>
> Key: SPARK-29804
> URL: https://issues.apache.org/jira/browse/SPARK-29804
> Project: Spark
>  Issue Type: Question
>  Components: YARN
>Affects Versions: 2.4.4
> Environment: Spark2.4.4, Apache Hadoop 3.1.2
>Reporter: Srujan A
>Priority: Major
>
> I am trying to run the spark-shell on YARN mode from containers and it's 
> failing on below reason. Please help me out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29804) Spark-shell is failing on YARN mode

2019-11-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29804:
-
Target Version/s:   (was: 2.4.4)

> Spark-shell is failing on YARN mode
> ---
>
> Key: SPARK-29804
> URL: https://issues.apache.org/jira/browse/SPARK-29804
> Project: Spark
>  Issue Type: Question
>  Components: YARN
>Affects Versions: 2.4.4
> Environment: Spark2.4.4, Apache Hadoop 3.1.2
>Reporter: Srujan A
>Priority: Blocker
> Fix For: 2.4.4
>
>
> I am trying to run the spark-shell on YARN mode from containers and it's 
> failing on below reason. Please help me out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29804) Spark-shell is failing on YARN mode

2019-11-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29804:
-
Priority: Major  (was: Blocker)

> Spark-shell is failing on YARN mode
> ---
>
> Key: SPARK-29804
> URL: https://issues.apache.org/jira/browse/SPARK-29804
> Project: Spark
>  Issue Type: Question
>  Components: YARN
>Affects Versions: 2.4.4
> Environment: Spark2.4.4, Apache Hadoop 3.1.2
>Reporter: Srujan A
>Priority: Major
> Fix For: 2.4.4
>
>
> I am trying to run the spark-shell on YARN mode from containers and it's 
> failing on below reason. Please help me out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29806) Using multiline option for a JSON file which is not multiline results in silent truncation of data.

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971730#comment-16971730
 ] 

Hyukjin Kwon commented on SPARK-29806:
--

{{multiline}} in JSON source currently only supports one JSON object or a JSON 
array.

> Using multiline option for a JSON file which is not multiline results in 
> silent truncation of data.
> ---
>
> Key: SPARK-29806
> URL: https://issues.apache.org/jira/browse/SPARK-29806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Priority: Major
>
> The content of input Json File.
> {code:java}
> {"name":"John", "id":"100"}
> {"name":"Marry","id":"200"}{code}
> The above is valid json file but every record is in single line. But trying 
> to read this file
>  with a multiline option with FAILFAST mode, results in data truncation 
> without any error.
> {code:java}
> scala> spark.read.option("multiLine", true).option("mode", 
> "FAILFAST").format("json").load("/tmp/json").show(false)
> +---++
> |id |name|
> +---++
> |100|John|
> +---++
> scala> spark.read.option("mode", 
> "FAILFAST").format("json").load("/tmp/json").show(false)
> +---+-+
> |id |name |
> +---+-+
> |100|John |
> |200|Marry|
> +---+-+{code}
> I think Spark should return an error in this case especially in FAILFAST 
> mode. This can be a common user error and we should not do silent data 
> truncation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer

2019-11-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971727#comment-16971727
 ] 

Hyukjin Kwon commented on SPARK-29830:
--

(Please avoid to set target version which is usually reserved for committers)

> PySpark.context.Sparkcontext.binaryfiles improved memory with buffer
> 
>
> Key: SPARK-29830
> URL: https://issues.apache.org/jira/browse/SPARK-29830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Jörn Franke
>Priority: Major
>
> At the moment, Pyspark reads binary files into a byte array directly. This 
> means it reads the full binary file immediately into memory, which is 1) 
> memory in-efficient 2) differs from the Scala implementation (see pyspark 
> here: 
> [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles).
>    
> |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles]
> In Scala, Spark returns a PortableDataStream, which means the application 
> does not need to read the full content of the stream in memory to work on it 
> (see 
> [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).]
>  
> Hence, it is proposed to adapt the Pyspark implementation to return something 
> similar to a PortableDataStream in Scala (e.g. 
> [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].]
>  
> Reading binary files in an efficient manner is crucial for many IoT 
> applications, but potentially also other fields (e.g. disk image analysis in 
> forensics).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer

2019-11-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29830:
-
Target Version/s:   (was: 3.0.0)

> PySpark.context.Sparkcontext.binaryfiles improved memory with buffer
> 
>
> Key: SPARK-29830
> URL: https://issues.apache.org/jira/browse/SPARK-29830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Jörn Franke
>Priority: Major
>
> At the moment, Pyspark reads binary files into a byte array directly. This 
> means it reads the full binary file immediately into memory, which is 1) 
> memory in-efficient 2) differs from the Scala implementation (see pyspark 
> here: 
> [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles).
>    
> |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles]
> In Scala, Spark returns a PortableDataStream, which means the application 
> does not need to read the full content of the stream in memory to work on it 
> (see 
> [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).]
>  
> Hence, it is proposed to adapt the Pyspark implementation to return something 
> similar to a PortableDataStream in Scala (e.g. 
> [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].]
>  
> Reading binary files in an efficient manner is crucial for many IoT 
> applications, but potentially also other fields (e.g. disk image analysis in 
> forensics).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-11-11 Thread Ruslan Dautkhanov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971723#comment-16971723
 ] 

Ruslan Dautkhanov commented on SPARK-22340:
---

Glad to see this is solved. 

A nice side-effect should be somewhat better performance on some cases 
involving heavy python-java communication
on multi-numa/ multi-socket configurations. With static threads, Linux kernel 
will actually have a chance to 
schedule threads on processors/cores that are more local to data's numa 
placement. 

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Mortenson
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971680#comment-16971680
 ] 

Aman Omer commented on SPARK-29838:
---

Sure, I have started coding for this one and analyzed 
https://issues.apache.org/jira/browse/SPARK-29840 , 
https://issues.apache.org/jira/browse/SPARK-29842 . I will soon raise PR.

[~Ngone51] you can plan your work accordingly and kindly inform (in comments) 
before starting to avoid duplicate efforts.

> PostgreSQL dialect: cast to timestamp
> -
>
> Key: SPARK-29838
> URL: https://issues.apache.org/jira/browse/SPARK-29838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-11 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971654#comment-16971654
 ] 

Sean R. Owen commented on SPARK-29844:
--

Nice, I wonder what else CacheCheck will turn up?
I think #1 isnt a problem in master, at least judging by the PR.
#2 seems valid.
Yes the caller has to deal with unpersisting these if desired. 

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Minor
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29850) sort-merge-join an empty table should not memory leak

2019-11-11 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29850:
---

 Summary: sort-merge-join an empty table should not memory leak
 Key: SPARK-29850
 URL: https://issues.apache.org/jira/browse/SPARK-29850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29844:
-
Priority: Minor  (was: Major)

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Minor
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29848) PostgreSQL dialect: cast to bigint

2019-11-11 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971580#comment-16971580
 ] 

Rakesh Raushan commented on SPARK-29848:


I will work on this one


> PostgreSQL dialect: cast to bigint
> --
>
> Key: SPARK-29848
> URL: https://issues.apache.org/jira/browse/SPARK-29848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark:
> 0: jdbc:hive2://10.18.19.208:23040/default> select CAST('0xcc' AS 
> bigint);
> +---+
> | CAST(0xcc AS BIGINT) |
> +---+
> | NULL |
> +---+
> Postgre SQL
> 22P02: invalid input syntax for integer: "0xcc"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL

2019-11-11 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971579#comment-16971579
 ] 

Rakesh Raushan commented on SPARK-29849:


I will work on this

> Spark trunc() func does not support for number group as PostgreSQL
> --
>
> Key: SPARK-29849
> URL: https://issues.apache.org/jira/browse/SPARK-29849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> PostgreSQL trunc() function accepts number group as below
> SELECT trunc(1234567891.1234567891,4);
> output
> |1|1234567891,1234|
> Spark does not accept
> jdbc:hive2://10.18.19.208:23040/default> SELECT 
> trunc(1234567891.1234567891D,4);
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: 
> argument 1 requires date type, however, '1.2345678911234567E9D' is of double 
> type.; line 1 pos 7;
> 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), 
> None)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971574#comment-16971574
 ] 

Aman Omer commented on SPARK-29849:
---

I am checking this issue.

> Spark trunc() func does not support for number group as PostgreSQL
> --
>
> Key: SPARK-29849
> URL: https://issues.apache.org/jira/browse/SPARK-29849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> PostgreSQL trunc() function accepts number group as below
> SELECT trunc(1234567891.1234567891,4);
> output
> |1|1234567891,1234|
> Spark does not accept
> jdbc:hive2://10.18.19.208:23040/default> SELECT 
> trunc(1234567891.1234567891D,4);
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: 
> argument 1 requires date type, however, '1.2345678911234567E9D' is of double 
> type.; line 1 pos 7;
> 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), 
> None)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL

2019-11-11 Thread Aman Omer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Omer updated SPARK-29849:
--
Comment: was deleted

(was: I am checking this issue.)

> Spark trunc() func does not support for number group as PostgreSQL
> --
>
> Key: SPARK-29849
> URL: https://issues.apache.org/jira/browse/SPARK-29849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> PostgreSQL trunc() function accepts number group as below
> SELECT trunc(1234567891.1234567891,4);
> output
> |1|1234567891,1234|
> Spark does not accept
> jdbc:hive2://10.18.19.208:23040/default> SELECT 
> trunc(1234567891.1234567891D,4);
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: 
> argument 1 requires date type, however, '1.2345678911234567E9D' is of double 
> type.; line 1 pos 7;
> 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), 
> None)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp

2019-11-11 Thread wuyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971570#comment-16971570
 ] 

wuyi commented on SPARK-29838:
--

Hey guys, what's going on here ? I see [~aman_omer] you has commented several 
sub tasks saying you're going to do them.

But I'm also planning to do these tasks. Maybe we can cooperate with each other 
to avoid duplicate work ?

> PostgreSQL dialect: cast to timestamp
> -
>
> Key: SPARK-29838
> URL: https://issues.apache.org/jira/browse/SPARK-29838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29849:


 Summary: Spark trunc() func does not support for number group as 
PostgreSQL
 Key: SPARK-29849
 URL: https://issues.apache.org/jira/browse/SPARK-29849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


PostgreSQL trunc() function accepts number group as below

SELECT trunc(1234567891.1234567891,4);

output


|1|1234567891,1234|

Spark does not accept

jdbc:hive2://10.18.19.208:23040/default> SELECT trunc(1234567891.1234567891D,4);
Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: 
argument 1 requires date type, however, '1.2345678911234567E9D' is of double 
type.; line 1 pos 7;
'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), None)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-11 Thread Dong Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29844:
--
Affects Version/s: (was: 3.0.0)
   2.4.3

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29822) Cast error when there are spaces between signs and values

2019-11-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29822:
---

Assignee: Kent Yao

> Cast error when there are spaces between signs and values
> -
>
> Key: SPARK-29822
> URL: https://issues.apache.org/jira/browse/SPARK-29822
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> With the latest string to literal optimization, some interval strings can not 
> be cast when there are some spaces between signs and unit values.
> How to reproduce, 
> {code:java}
> select cast(v as interval) from values ('+ 1 second') t(v);
> select cast(v as interval) from values ('- 1 second') t(v);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29822) Cast error when there are spaces between signs and values

2019-11-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29822.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26449
[https://github.com/apache/spark/pull/26449]

> Cast error when there are spaces between signs and values
> -
>
> Key: SPARK-29822
> URL: https://issues.apache.org/jira/browse/SPARK-29822
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> With the latest string to literal optimization, some interval strings can not 
> be cast when there are some spaces between signs and unit values.
> How to reproduce, 
> {code:java}
> select cast(v as interval) from values ('+ 1 second') t(v);
> select cast(v as interval) from values ('- 1 second') t(v);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29848) PostgreSQL dialect: cast to bigint

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29848:


 Summary: PostgreSQL dialect: cast to bigint
 Key: SPARK-29848
 URL: https://issues.apache.org/jira/browse/SPARK-29848
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark:

0: jdbc:hive2://10.18.19.208:23040/default> select CAST('0xcc' AS 
bigint);
+---+
| CAST(0xcc AS BIGINT) |
+---+
| NULL |
+---+

Postgre SQL

22P02: invalid input syntax for integer: "0xcc"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29845) PostgreSQL dialect: cast to decimal

2019-11-11 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971516#comment-16971516
 ] 

Rakesh Raushan commented on SPARK-29845:


I will work on this

> PostgreSQL dialect: cast to decimal
> ---
>
> Key: SPARK-29845
> URL: https://issues.apache.org/jira/browse/SPARK-29845
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to decimal behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29847) PostgreSQL dialect: cast to varchar

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971513#comment-16971513
 ] 

Aman Omer commented on SPARK-29847:
---

I am checking this one.

> PostgreSQL dialect: cast to varchar
> ---
>
> Key: SPARK-29847
> URL: https://issues.apache.org/jira/browse/SPARK-29847
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> In Spark
> jdbc:hive2://10.18.19.208:23040/default> select cast('10.345bb' as 
> varchar(10));
> +---+
> | CAST(10.345bb AS STRING) |
> +---+
> | 10.345*bb* |
> +---+
>  
> In PostgreSQL
> select cast('10.345bb' as varchar(10));
>    varchar   varchar1 *10.345*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29846) PostgreSQL dialect: cast to char

2019-11-11 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971505#comment-16971505
 ] 

Ankit Raj Boudh commented on SPARK-29846:
-

I will raise PR for this.

> PostgreSQL dialect: cast to char
> 
>
> Key: SPARK-29846
> URL: https://issues.apache.org/jira/browse/SPARK-29846
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to char behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.
> {code:java}
> spark-sql> select cast ('10.22333' as 
> char(5));
> 10.22333
> Time taken: 0.062 seconds, Fetched 1 row(s)
> spark-sql>
> {code}
> *postgresql*
>  select cast ('10.22333' as char(5));
>  
> ||  ||bpchar||
> |1|10.22|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29847) PostgreSQL dialect: cast to varchar

2019-11-11 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29847:


 Summary: PostgreSQL dialect: cast to varchar
 Key: SPARK-29847
 URL: https://issues.apache.org/jira/browse/SPARK-29847
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


In Spark

jdbc:hive2://10.18.19.208:23040/default> select cast('10.345bb' as 
varchar(10));
+---+
| CAST(10.345bb AS STRING) |
+---+
| 10.345*bb* |
+---+

 

In PostgreSQL

select cast('10.345bb' as varchar(10));

   varchar   varchar1 *10.345*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29846) PostgreSQL dialect: cast to char

2019-11-11 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-29846:
-
Description: 
Make SparkSQL's cast to char behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.
{code:java}
spark-sql> select cast ('10.22333' as char(5));
10.22333
Time taken: 0.062 seconds, Fetched 1 row(s)
spark-sql>
{code}
*postgresql*
 select cast ('10.22333' as char(5));
 
||  ||bpchar||
|1|10.22|

  was:
Make SparkSQL's cast to char behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.
{code:java}
spark-sql> select cast ('10.22333' as char(5));
10.22333
Time taken: 0.062 seconds, Fetched 1 row(s)
spark-sql>
*postgresql*
select cast ('10.22333' as char(5));
bpchar
1   10.22


{code}


> PostgreSQL dialect: cast to char
> 
>
> Key: SPARK-29846
> URL: https://issues.apache.org/jira/browse/SPARK-29846
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to char behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.
> {code:java}
> spark-sql> select cast ('10.22333' as 
> char(5));
> 10.22333
> Time taken: 0.062 seconds, Fetched 1 row(s)
> spark-sql>
> {code}
> *postgresql*
>  select cast ('10.22333' as char(5));
>  
> ||  ||bpchar||
> |1|10.22|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29846) PostgreSQL dialect: cast to char

2019-11-11 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-29846:
-
Description: 
Make SparkSQL's cast to char behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.
{code:java}
spark-sql> select cast ('10.22333' as char(5));
10.22333
Time taken: 0.062 seconds, Fetched 1 row(s)
spark-sql>
*postgresql*
select cast ('10.22333' as char(5));
bpchar
1   10.22


{code}

  was:
Make SparkSQL's cast to char behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.


> PostgreSQL dialect: cast to char
> 
>
> Key: SPARK-29846
> URL: https://issues.apache.org/jira/browse/SPARK-29846
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to char behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.
> {code:java}
> spark-sql> select cast ('10.22333' as 
> char(5));
> 10.22333
> Time taken: 0.062 seconds, Fetched 1 row(s)
> spark-sql>
> *postgresql*
> select cast ('10.22333' as char(5));
>   bpchar
> 1 10.22
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29846) PostgreSQL dialect: cast to char

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29846:


 Summary: PostgreSQL dialect: cast to char
 Key: SPARK-29846
 URL: https://issues.apache.org/jira/browse/SPARK-29846
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to char behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29845) PostgreSQL dialect: cast to decimal

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29845:


 Summary: PostgreSQL dialect: cast to decimal
 Key: SPARK-29845
 URL: https://issues.apache.org/jira/browse/SPARK-29845
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to decimal behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29840) PostgreSQL dialect: cast to integer

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971476#comment-16971476
 ] 

Aman Omer commented on SPARK-29840:
---

Working on this.

> PostgreSQL dialect: cast to integer
> ---
>
> Key: SPARK-29840
> URL: https://issues.apache.org/jira/browse/SPARK-29840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to integer  behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.
> Example:*currently spark sql*
> {code:java}
> spark-sql> select   CAST ('10C' AS INTEGER);
> NULL
> Time taken: 0.051 seconds, Fetched 1 row(s)
> spark-sql>
> {code}
> *postgresql*
> {code:java}
> postgresql
> select   CAST ('10C' AS INTEGER);
> Error(s), warning(s):
> 22P02: invalid input syntax for integer: "10C"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29842) PostgreSQL dialect: cast to double

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971477#comment-16971477
 ] 

Aman Omer commented on SPARK-29842:
---

I will work on this

> PostgreSQL dialect: cast to double
> --
>
> Key: SPARK-29842
> URL: https://issues.apache.org/jira/browse/SPARK-29842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to double behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.
> some examples
> {code:java}
> spark-sql> select CAST ('10.2' AS DOUBLE PRECISION);
> Error in query:
> extraneous input 'PRECISION' expecting ')'(line 1, pos 30)
> == SQL ==
> select CAST ('10.2' AS DOUBLE PRECISION)
> --^^^
> spark-sql> select CAST ('10.2' AS DOUBLE PRECISION);
> Error in query:
> extraneous input 'PRECISION' expecting ')'(line 1, pos 30)
> == SQL ==
> select CAST ('10.2' AS DOUBLE PRECISION)
> --^^^
> spark-sql> select CAST ('10.2' AS DOUBLE);
> 10.2
> Time taken: 0.08 seconds, Fetched 1 row(s)
> spark-sql> select CAST ('10.' AS DOUBLE);
> 10.
> Time taken: 0.08 seconds, Fetched 1 row(s)
> spark-sql> select CAST ('ff' AS DOUBLE);
> NULL
> Time taken: 0.08 seconds, Fetched 1 row(s)
> spark-sql> select CAST ('1' AS DOUBLE);
> 1.1112E16
> Time taken: 0.067 seconds, Fetched 1 row(s)
> spark-sql> 
> {code}
> Postgresql
> select CAST ('10.222' AS DOUBLE PRECISION);
>  select CAST ('1' AS DOUBLE PRECISION);
>  select CAST ('ff' AS DOUBLE PRECISION);
>  
>  
> ||  ||float8||
> |1|10,222|
>  
> ||  ||float8||
> |1|1,11E+16|
> Error(s), warning(s):
> 22P02: invalid input syntax for type double precision: "ff"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29843) PostgreSQL dialect: cast to float

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971478#comment-16971478
 ] 

Aman Omer commented on SPARK-29843:
---

I will work on this

> PostgreSQL dialect: cast to float
> -
>
> Key: SPARK-29843
> URL: https://issues.apache.org/jira/browse/SPARK-29843
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to float behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train

2019-11-11 Thread Dong Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29844:
--
Summary: Improper unpersist strategy in ml.recommendation.ASL.train  (was: 
Wrong unpersist strategy in ml.recommendation.ASL.train)

> Improper unpersist strategy in ml.recommendation.ASL.train
> --
>
> Key: SPARK-29844
> URL: https://issues.apache.org/jira/browse/SPARK-29844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the 
> end of the method, these RDDs invoke unpersist(), but the timings of 
> unpersist is not right, which will cause recomputation and memory waste.
> {code:scala}
> val userIdAndFactors = userInBlocks
>   .mapValues(_.srcIds)
>   .join(userFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   // Preserve the partitioning because IDs are consistent with the 
> partitioners in userInBlocks
>   // and userFactors.
>   }, preservesPartitioning = true)
>   .setName("userFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> val itemIdAndFactors = itemInBlocks
>   .mapValues(_.srcIds)
>   .join(itemFactors)
>   .mapPartitions({ items =>
> items.flatMap { case (_, (ids, factors)) =>
>   ids.view.zip(factors)
> }
>   }, preservesPartitioning = true)
>   .setName("itemFactors")
>   .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
> if (finalRDDStorageLevel != StorageLevel.NONE) {
>   userIdAndFactors.count()
>   itemFactors.unpersist() // Premature unpersist
>   itemIdAndFactors.count()
>   userInBlocks.unpersist() // Lagging unpersist
>   userOutBlocks.unpersist() // Lagging unpersist
>   itemInBlocks.unpersist() 
>   itemOutBlocks.unpersist() // Lagging unpersist
>   blockRatings.unpersist() // Lagging unpersist
> }
> (userIdAndFactors, itemIdAndFactors)
>   }
> {code}
> 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
> itemFactors. So itemFactors will be recomputed.
> 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
> late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
> these RDDs can be unpersisted before it to save memory.
> By the way, itemIdAndFactors is persisted here but will never be unpersisted 
> util the application ends. It may hurts the performance, but I think it's 
> hard to fix.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29844) Wrong unpersist strategy in ml.recommendation.ASL.train

2019-11-11 Thread Dong Wang (Jira)

Dong Wang created SPARK-29844:
-

 Summary: Wrong unpersist strategy in ml.recommendation.ASL.train
 Key: SPARK-29844
 URL: https://issues.apache.org/jira/browse/SPARK-29844
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Dong Wang


In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the end 
of the method, these RDDs invoke unpersist(), but the timings of unpersist is 
not right, which will cause recomputation and memory waste.
{code:scala}
val userIdAndFactors = userInBlocks
  .mapValues(_.srcIds)
  .join(userFactors)
  .mapPartitions({ items =>
items.flatMap { case (_, (ids, factors)) =>
  ids.view.zip(factors)
}
  // Preserve the partitioning because IDs are consistent with the 
partitioners in userInBlocks
  // and userFactors.
  }, preservesPartitioning = true)
  .setName("userFactors")
  .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
val itemIdAndFactors = itemInBlocks
  .mapValues(_.srcIds)
  .join(itemFactors)
  .mapPartitions({ items =>
items.flatMap { case (_, (ids, factors)) =>
  ids.view.zip(factors)
}
  }, preservesPartitioning = true)
  .setName("itemFactors")
  .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix
if (finalRDDStorageLevel != StorageLevel.NONE) {
  userIdAndFactors.count()
  itemFactors.unpersist() // Premature unpersist
  itemIdAndFactors.count()
  userInBlocks.unpersist() // Lagging unpersist
  userOutBlocks.unpersist() // Lagging unpersist
  itemInBlocks.unpersist() 
  itemOutBlocks.unpersist() // Lagging unpersist
  blockRatings.unpersist() // Lagging unpersist
}
(userIdAndFactors, itemIdAndFactors)
  }
{code}

1. Unpersist itemFactors too early. itemIdAndFactors.count() will use 
itemFactors. So itemFactors will be recomputed.
2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too 
late. The final action - itemIdAndFactors.count() will not use these RDDs, so 
these RDDs can be unpersisted before it to save memory.
By the way, itemIdAndFactors is persisted here but will never be unpersisted 
util the application ends. It may hurts the performance, but I think it's hard 
to fix.

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29843) PostgreSQL dialect: cast to float

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29843:


 Summary: PostgreSQL dialect: cast to float
 Key: SPARK-29843
 URL: https://issues.apache.org/jira/browse/SPARK-29843
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to float behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29842) PostgreSQL dialect: cast to double

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29842:


 Summary: PostgreSQL dialect: cast to double
 Key: SPARK-29842
 URL: https://issues.apache.org/jira/browse/SPARK-29842
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to double behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.

some examples
{code:java}
spark-sql> select CAST ('10.2' AS DOUBLE PRECISION);
Error in query:
extraneous input 'PRECISION' expecting ')'(line 1, pos 30)

== SQL ==
select CAST ('10.2' AS DOUBLE PRECISION)
--^^^

spark-sql> select CAST ('10.2' AS DOUBLE PRECISION);
Error in query:
extraneous input 'PRECISION' expecting ')'(line 1, pos 30)

== SQL ==
select CAST ('10.2' AS DOUBLE PRECISION)
--^^^

spark-sql> select CAST ('10.2' AS DOUBLE);
10.2
Time taken: 0.08 seconds, Fetched 1 row(s)
spark-sql> select CAST ('10.' AS DOUBLE);
10.
Time taken: 0.08 seconds, Fetched 1 row(s)
spark-sql> select CAST ('ff' AS DOUBLE);
NULL
Time taken: 0.08 seconds, Fetched 1 row(s)
spark-sql> select CAST ('1' AS DOUBLE);
1.1112E16
Time taken: 0.067 seconds, Fetched 1 row(s)
spark-sql> 
{code}
Postgresql

select CAST ('10.222' AS DOUBLE PRECISION);
 select CAST ('1' AS DOUBLE PRECISION);
 select CAST ('ff' AS DOUBLE PRECISION);

 
 
||  ||float8||
|1|10,222|
 
||  ||float8||
|1|1,11E+16|

Error(s), warning(s):

22P02: invalid input syntax for type double precision: "ff"

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29841) PostgreSQL dialect: cast to date

2019-11-11 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971470#comment-16971470
 ] 

pavithra ramachandran commented on SPARK-29841:
---

i will check

> PostgreSQL dialect: cast to date
> 
>
> Key: SPARK-29841
> URL: https://issues.apache.org/jira/browse/SPARK-29841
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to date behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29841) PostgreSQL dialect: cast to date

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29841:


 Summary: PostgreSQL dialect: cast to date
 Key: SPARK-29841
 URL: https://issues.apache.org/jira/browse/SPARK-29841
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to date behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29840) PostgreSQL dialect: cast to integer

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29840:


 Summary: PostgreSQL dialect: cast to integer
 Key: SPARK-29840
 URL: https://issues.apache.org/jira/browse/SPARK-29840
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to integer  behavior be consistent with PostgreSQL when 

spark.sql.dialect is configured as PostgreSQL.

Example:*currently spark sql*
{code:java}
spark-sql> select   CAST ('10C' AS INTEGER);
NULL
Time taken: 0.051 seconds, Fetched 1 row(s)
spark-sql>
{code}
*postgresql*
{code:java}
postgresql
select   CAST ('10C' AS INTEGER);
Error(s), warning(s):

22P02: invalid input syntax for integer: "10C"
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29839) Supporting STORED AS in CREATE TABLE LIKE

2019-11-11 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29839:
---
Summary: Supporting STORED AS in CREATE TABLE LIKE  (was: Support STORED AS 
in CREATE TABLE LIKE)

> Supporting STORED AS in CREATE TABLE LIKE
> -
>
> Key: SPARK-29839
> URL: https://issues.apache.org/jira/browse/SPARK-29839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In SPARK-29421, we can specify a different table provider for {{CREATE TABLE 
> LIKE}} via {{USING provider}}. 
> Hive support STORED AS new file format syntax:
> {code}
> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
> {code}
> For Hive compatibility, we should also support {{STORED AS}} in {{CREATE 
> TABLE LIKE}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29839) Support STORED AS in CREATE TABLE LIKE

2019-11-11 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29839:
--

 Summary: Support STORED AS in CREATE TABLE LIKE
 Key: SPARK-29839
 URL: https://issues.apache.org/jira/browse/SPARK-29839
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


In SPARK-29421, we can specify a different table provider for {{CREATE TABLE 
LIKE}} via {{USING provider}}. 
Hive support STORED AS new file format syntax:
{code}
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
{code}
For Hive compatibility, we should also support {{STORED AS}} in {{CREATE TABLE 
LIKE}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29775) Support truncate multiple tables

2019-11-11 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971448#comment-16971448
 ] 

Rakesh Raushan commented on SPARK-29775:


i will work on this


> Support truncate multiple tables
> 
>
> Key: SPARK-29775
> URL: https://issues.apache.org/jira/browse/SPARK-29775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> Spark sql Support truncate single table like 
> TRUNCATE table t1;
> But postgresql support truncating multiple tables like 
> TRUNCATE bigtable, fattable;
> So spark also can support truncating multiple tables
> [https://www.postgresql.org/docs/12/sql-truncate.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp

2019-11-11 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971433#comment-16971433
 ] 

Aman Omer commented on SPARK-29838:
---

Already working on this. Thanks [~jobitmathew]

> PostgreSQL dialect: cast to timestamp
> -
>
> Key: SPARK-29838
> URL: https://issues.apache.org/jira/browse/SPARK-29838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>
> Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29838) PostgreSQL dialect: cast to timestamp

2019-11-11 Thread jobit mathew (Jira)

jobit mathew created SPARK-29838:


 Summary: PostgreSQL dialect: cast to timestamp
 Key: SPARK-29838
 URL: https://issues.apache.org/jira/browse/SPARK-29838
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when 

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29837) PostgreSQL dialect: cast to boolean

2019-11-11 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-29837:
-
Parent: SPARK-29836
Issue Type: Sub-task  (was: Task)

> PostgreSQL dialect: cast to boolean
> ---
>
> Key: SPARK-29837
> URL: https://issues.apache.org/jira/browse/SPARK-29837
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29837) PostgreSQL dialect: cast to boolean

2019-11-11 Thread wuyi (Jira)

wuyi created SPARK-29837:


 Summary: PostgreSQL dialect: cast to boolean
 Key: SPARK-29837
 URL: https://issues.apache.org/jira/browse/SPARK-29837
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when 

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29836) PostgreSQL dialect: cast

2019-11-11 Thread wuyi (Jira)

wuyi created SPARK-29836:


 Summary: PostgreSQL dialect: cast
 Key: SPARK-29836
 URL: https://issues.apache.org/jira/browse/SPARK-29836
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


SparkSQL and PostgreSQL have a lot different cast behavior between types by 
default. We should make SparkSQL's cast behavior be consistent with PostgreSQL 
when spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo