[jira] [Updated] (SPARK-29815) Missing persist in ml.tuning.CrossValidator.fit()
[ https://issues.apache.org/jira/browse/SPARK-29815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Wang updated SPARK-29815: -- Description: dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will generate two rdds: training and validation. Some actions will be operated on these two rdds, but dataset.toDF.rdd is not persisted, which will cause recomputation. {code:scala} // Compute metrics for each model over each split val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // dataset.toDF.rdd should be persisted val metrics = splits.zipWithIndex.map { case ((training, validation), splitIndex) => val trainingDataset = sparkSession.createDataFrame(training, schema).cache() val validationDataset = sparkSession.createDataFrame(validation, schema).cache() {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. was: dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will generate two rdds: training and validation. Some actions will be operated on these two rdds, but dataset.toDF.rdd is not persisted, which will cause recomputation. {code:scala} // Compute metrics for each model over each split val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // dataset.toDF.rdd should be persisted val metrics = splits.zipWithIndex.map { case ((training, validation), splitIndex) => val trainingDataset = sparkSession.createDataFrame(training, schema).cache() val validationDataset = sparkSession.createDataFrame(validation, schema).cache() {scala} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. > Missing persist in ml.tuning.CrossValidator.fit() > - > > Key: SPARK-29815 > URL: https://issues.apache.org/jira/browse/SPARK-29815 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Major > > dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will > generate two rdds: training and validation. Some actions will be operated on > these two rdds, but dataset.toDF.rdd is not persisted, which will cause > recomputation. > {code:scala} > // Compute metrics for each model over each split > val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // > dataset.toDF.rdd should be persisted > val metrics = splits.zipWithIndex.map { case ((training, validation), > splitIndex) => > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29816) Missing persist in mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()
[ https://issues.apache.org/jira/browse/SPARK-29816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Wang updated SPARK-29816: -- Description: The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and count(), so it needs to be persisted. {code:scala} val counts = scoreAndLabels.combineByKey( createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label, mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 ).sortByKey(ascending = false) // first use val binnedCounts = // Only down-sample if bins is > 0 if (numBins == 0) { // Use original directly counts } else { val countsSize = counts.count() //second use {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. was: The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and count(), so it needs to be persisted. {code:scala} val counts = scoreAndLabels.combineByKey( createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label, mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 ).sortByKey(ascending = false) // first use val binnedCounts = // Only down-sample if bins is > 0 if (numBins == 0) { // Use original directly counts } else { val countsSize = counts.count() //second use {scala} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. > Missing persist in > mllib.evaluation.BinaryClassificationMetrics.recallByThreshold() > --- > > Key: SPARK-29816 > URL: https://issues.apache.org/jira/browse/SPARK-29816 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Minor > > The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and > count(), so it needs to be persisted. > {code:scala} > val counts = scoreAndLabels.combineByKey( > createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += > label, > mergeValue = (c: BinaryLabelCounter, label: Double) => c += label, > mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 > += c2 > ).sortByKey(ascending = false) // first use > val binnedCounts = > // Only down-sample if bins is > 0 > if (numBins == 0) { > // Use original directly > counts > } else { > val countsSize = counts.count() //second use > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
Dong Wang created SPARK-29856: - Summary: Conditional unnecessary persist on RDDs in ML algorithms Key: SPARK-29856 URL: https://issues.apache.org/jira/browse/SPARK-29856 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 3.0.0 Reporter: Dong Wang When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary. {code:scala} val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, (tp: TreePoint) => tp.weight, seed = seed) .persist(StorageLevel.MEMORY_AND_DISK) ... while (nodeStack.nonEmpty) { ... timer.start("findBestSplits") RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup, treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) timer.stop("findBestSplits") } baggedInput.unpersist() {code} However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}. In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now. That's the point why the persist operation is "conditional" unnecessary. Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29001) Print better log when process of events becomes slow
[ https://issues.apache.org/jira/browse/SPARK-29001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29001. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25702 [https://github.com/apache/spark/pull/25702] > Print better log when process of events becomes slow > > > Key: SPARK-29001 > URL: https://issues.apache.org/jira/browse/SPARK-29001 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Minor > Fix For: 3.0.0 > > > We shall print better log when process of events becomes slow, to help find > out what type of events is slow. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29519: --- Assignee: Pablo Langa Blanco > SHOW TBLPROPERTIES should look up catalog/table like v2 commands > > > Key: SPARK-29519 > URL: https://issues.apache.org/jira/browse/SPARK-29519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > > SHOW TBLPROPERTIES should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29519) SHOW TBLPROPERTIES should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29519. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26176 [https://github.com/apache/spark/pull/26176] > SHOW TBLPROPERTIES should look up catalog/table like v2 commands > > > Key: SPARK-29519 > URL: https://issues.apache.org/jira/browse/SPARK-29519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > Fix For: 3.0.0 > > > SHOW TBLPROPERTIES should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29855) typed literals with negative sign with proper result or exception
Kent Yao created SPARK-29855: Summary: typed literals with negative sign with proper result or exception Key: SPARK-29855 URL: https://issues.apache.org/jira/browse/SPARK-29855 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao {code:java} -- !query 83 select -integer '7' -- !query 83 schema struct<7:int> -- !query 83 output 7 -- !query 86 select -date '1999-01-01' -- !query 86 schema struct -- !query 86 output 1999-01-01 -- !query 87 select -timestamp '1999-01-01' -- !query 87 schema struct -- !query 87 output 1999-01-01 00:00:00 {code} the integer should be -7 and the date and timestamp results are confusing which should throw exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29853) lpad returning empty instead of NULL for empty pad value
[ https://issues.apache.org/jira/browse/SPARK-29853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972079#comment-16972079 ] Ankit Raj Boudh commented on SPARK-29853: - @[~hyukjin.kwon] , please review the PR. > lpad returning empty instead of NULL for empty pad value > > > Key: SPARK-29853 > URL: https://issues.apache.org/jira/browse/SPARK-29853 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, ''); > ++--+ > | lpad(hi, 5, ) | > ++--+ > | hi | > ++--+ > 1 row selected (0.186 seconds) > Hive: > INFO : Concurrency mode is disabled, not creating a lock manager > +---+ > | _c0 | > +---+ > | NULL | > +---+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29854: - Description: Spark Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive was: Spark Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ | | ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29854: - Description: Spark Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ | | ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive was: Spark:( Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ | lpad(hihhh, CAST(5000 AS INT), ) | ++ | | ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > | | > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972076#comment-16972076 ] Ankit Raj Boudh commented on SPARK-29854: - i will raise PR for this. > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark:( Returns Empty String) > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > | lpad(hihhh, CAST(5000 AS INT), > ) | > ++ > | | > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
ABHISHEK KUMAR GUPTA created SPARK-29854: Summary: lpad and rpad built in function not throw Exception for invalid len value Key: SPARK-29854 URL: https://issues.apache.org/jira/browse/SPARK-29854 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark:( Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ | lpad(hihhh, CAST(5000 AS INT), ) | ++ | | ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29853) lpad returning empty instead of NULL for empty pad value
[ https://issues.apache.org/jira/browse/SPARK-29853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972073#comment-16972073 ] Ankit Raj Boudh commented on SPARK-29853: - i will PR for this > lpad returning empty instead of NULL for empty pad value > > > Key: SPARK-29853 > URL: https://issues.apache.org/jira/browse/SPARK-29853 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, ''); > ++--+ > | lpad(hi, 5, ) | > ++--+ > | hi | > ++--+ > 1 row selected (0.186 seconds) > Hive: > INFO : Concurrency mode is disabled, not creating a lock manager > +---+ > | _c0 | > +---+ > | NULL | > +---+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29853) lpad returning empty instead of NULL for empty pad value
ABHISHEK KUMAR GUPTA created SPARK-29853: Summary: lpad returning empty instead of NULL for empty pad value Key: SPARK-29853 URL: https://issues.apache.org/jira/browse/SPARK-29853 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark 0: jdbc:hive2://10.18.18.214:23040/default> SELECT lpad('hi', 5, ''); ++--+ | lpad(hi, 5, ) | ++--+ | hi | ++--+ 1 row selected (0.186 seconds) Hive: INFO : Concurrency mode is disabled, not creating a lock manager +---+ | _c0 | +---+ | NULL | +---+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972035#comment-16972035 ] Ankit Raj Boudh edited comment on SPARK-29776 at 11/12/19 5:32 AM: --- [https://github.com/apache/spark/pull/26477] , [~hyukjin.kwon] please review this PR was (Author: ankitraj): yes, today i will submit PR for this. > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > {code} > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > {code} > +---+ > | _c0 | > +---+ > | NULL | > +---+ > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29778) saveAsTable append mode is not passing writer options
[ https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972048#comment-16972048 ] Wesley Hoffman commented on SPARK-29778: Looks like my PR linked right up. How handy! The test works by creating a custom query listener so that I can gather the {{LogicalPlan}} and assert the proper {{writeOptions.}} > saveAsTable append mode is not passing writer options > - > > Key: SPARK-29778 > URL: https://issues.apache.org/jira/browse/SPARK-29778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Critical > > There was an oversight where AppendData is not getting the WriterOptions in > saveAsTable. > [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-29792: --- Description: After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], the subqueries info can not be updated in AQE. And this Jira will fix it. > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > > After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], > the subqueries info can not be updated in AQE. And this Jira will fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972035#comment-16972035 ] Ankit Raj Boudh commented on SPARK-29776: - yes, today i will submit PR for this. > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > {code} > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > {code} > +---+ > | _c0 | > +---+ > | NULL | > +---+ > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972019#comment-16972019 ] zhao bo commented on SPARK-29106: - The arm worker is back. Sorry for late. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29808) StopWordsRemover should support multi-cols
[ https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29808: Assignee: Huaxin Gao > StopWordsRemover should support multi-cols > -- > > Key: SPARK-29808 > URL: https://issues.apache.org/jira/browse/SPARK-29808 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > > As a basic Transformer, StopWordsRemover should support multi-cols. > Param {color:#93a6f5}stopWords{color} can be applied across all columns. > {color:#93a6f5} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading
[ https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim updated SPARK-29851: -- Summary: V2 Catalog: Default behavior of dropping namespace is cascading (was: DataSourceV2: Default behavior of dropping namespace is cascading) > V2 Catalog: Default behavior of dropping namespace is cascading > --- > > Key: SPARK-29851 > URL: https://issues.apache.org/jira/browse/SPARK-29851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > Instead of introducing additional 'cascade' option to dropNamespace(), the > default behavior of dropping a namespace will be cascading. Now, to implement > the cascade option, Spark side needs to ensure a namespace is empty before > calling dropNamespace(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-29852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated SPARK-29852: --- Issue Type: Improvement (was: New Feature) > Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator > - > > Key: SPARK-29852 > URL: https://issues.apache.org/jira/browse/SPARK-29852 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Peng Cheng >Priority: Major > Original Estimate: 0h > Remaining Estimate: 0h > > Both RDD and Dataset APIs have 2 methods of collecting data from executors to > driver: > > # .collect() setup multiple threads in a job and dump all data from executor > into drivers memory. This is great if data on driver needs to be accessible > ASAP, but not as efficient if access to partitions can only happen > sequentially, and outright risky if driver doesn't have enough memory to hold > all data. > - the solution for issue SPARK-25224 partially alleviate this by delaying > deserialisation of data in InternalRow format, such that only the much > smaller serialised data needs to be entirely hold by driver memory. This > solution does not abide O(1) memory consumption, thus does not scale to > arbitrarily large dataset > # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of > the next partition does not start until sequential access to previous > partition has concluded. This action abides O(1) memory consumption and is > great if access to data is sequential and significantly slower than the speed > where partitions can be shipped from a single executor, with 1 thread. It > becomes inefficient when the sequential access to data has to wait for a > relatively long time for the shipping of the next partition > The proposed solution is a crossover between two existing implementations: a > concurrent subroutine that is both CPU and memory bounded. The solution > allocate a fixed sized resource pool (by default = number of available CPU > cores) that serves the shipping of partitions concurrently, and block > sequential access to partitions' data until shipping is finished (which > usually happens without blocking for partitionID >=2 due to the fact that > shipping start much earlier and preemptively). Tenants of the resource pool > can be GC'ed and evicted once sequential access to it's data has finished, > which allows more partitions to be fetched much earlier than they are > accessed. The maximum memory consumption is O(m * n), where m is the > predefined concurrency and n is the size of the largest partition. > The following scala code snippet demonstrates a simple implementation: > > (requires scala 2.11 + and ScalaTests) > > {code:java} > package org.apache.spark.spike > import java.util.concurrent.ArrayBlockingQueue > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.SparkSession > import org.apache.spark.{FutureAction, SparkContext} > import org.scalatest.FunSpec > import scala.concurrent.Future > import scala.language.implicitConversions > import scala.reflect.ClassTag > import scala.util.{Failure, Success, Try} > class ToLocalIteratorPreemptivelySpike extends FunSpec { > import ToLocalIteratorPreemptivelySpike._ > lazy val sc: SparkContext = > SparkSession.builder().master("local[*]").getOrCreate().sparkContext > it("can be much faster than toLocalIterator") { > val max = 80 > val delay = 100 > val slowRDD = sc.parallelize(1 to max, 8).map { v => > Thread.sleep(delay) > v > } > val (r1, t1) = timed { > slowRDD.toLocalIterator.toList > } > val capacity = 4 > val (r2, t2) = timed { > slowRDD.toLocalIteratorPreemptively(capacity).toList > } > assert(r1 == r2) > println(s"linear: $t1, preemptive: $t2") > assert(t1 > t2 * 2) > assert(t2 > max * delay / capacity) > } > } > object ToLocalIteratorPreemptivelySpike { > case class PartitionExecution[T: ClassTag]( > @transient self: RDD[T], > id: Int > ) { > def eager: this.type = { > AsArray.future > this > } > case object AsArray { > @transient lazy val future: FutureAction[Array[T]] = { > var result: Array[T] = null > val future = self.context.submitJob[T, Array[T], Array[T]]( > self, > _.toArray, > Seq(id), { (_, data) => > result = data > }, > result > ) > future > } > @transient lazy val now: Array[T] = future.get() > } > } > implicit class RDDFunctions[T: ClassTag](self: RDD[T]) { > import scala.concurrent.ExecutionContext.Implicits.global > def _toLocalIteratorPreemptively(cap
[jira] [Created] (SPARK-29852) Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator
Peng Cheng created SPARK-29852: -- Summary: Implement parallel preemptive RDD.toLocalIterator and Dataset.toLocalIterator Key: SPARK-29852 URL: https://issues.apache.org/jira/browse/SPARK-29852 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Peng Cheng Both RDD and Dataset APIs have 2 methods of collecting data from executors to driver: # .collect() setup multiple threads in a job and dump all data from executor into drivers memory. This is great if data on driver needs to be accessible ASAP, but not as efficient if access to partitions can only happen sequentially, and outright risky if driver doesn't have enough memory to hold all data. - the solution for issue SPARK-25224 partially alleviate this by delaying deserialisation of data in InternalRow format, such that only the much smaller serialised data needs to be entirely hold by driver memory. This solution does not abide O(1) memory consumption, thus does not scale to arbitrarily large dataset # .toLocalIterator() fetch one partition in 1 job at a time, and fetching of the next partition does not start until sequential access to previous partition has concluded. This action abides O(1) memory consumption and is great if access to data is sequential and significantly slower than the speed where partitions can be shipped from a single executor, with 1 thread. It becomes inefficient when the sequential access to data has to wait for a relatively long time for the shipping of the next partition The proposed solution is a crossover between two existing implementations: a concurrent subroutine that is both CPU and memory bounded. The solution allocate a fixed sized resource pool (by default = number of available CPU cores) that serves the shipping of partitions concurrently, and block sequential access to partitions' data until shipping is finished (which usually happens without blocking for partitionID >=2 due to the fact that shipping start much earlier and preemptively). Tenants of the resource pool can be GC'ed and evicted once sequential access to it's data has finished, which allows more partitions to be fetched much earlier than they are accessed. The maximum memory consumption is O(m * n), where m is the predefined concurrency and n is the size of the largest partition. The following scala code snippet demonstrates a simple implementation: (requires scala 2.11 + and ScalaTests) {code:java} package org.apache.spark.spike import java.util.concurrent.ArrayBlockingQueue import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.spark.{FutureAction, SparkContext} import org.scalatest.FunSpec import scala.concurrent.Future import scala.language.implicitConversions import scala.reflect.ClassTag import scala.util.{Failure, Success, Try} class ToLocalIteratorPreemptivelySpike extends FunSpec { import ToLocalIteratorPreemptivelySpike._ lazy val sc: SparkContext = SparkSession.builder().master("local[*]").getOrCreate().sparkContext it("can be much faster than toLocalIterator") { val max = 80 val delay = 100 val slowRDD = sc.parallelize(1 to max, 8).map { v => Thread.sleep(delay) v } val (r1, t1) = timed { slowRDD.toLocalIterator.toList } val capacity = 4 val (r2, t2) = timed { slowRDD.toLocalIteratorPreemptively(capacity).toList } assert(r1 == r2) println(s"linear: $t1, preemptive: $t2") assert(t1 > t2 * 2) assert(t2 > max * delay / capacity) } } object ToLocalIteratorPreemptivelySpike { case class PartitionExecution[T: ClassTag]( @transient self: RDD[T], id: Int ) { def eager: this.type = { AsArray.future this } case object AsArray { @transient lazy val future: FutureAction[Array[T]] = { var result: Array[T] = null val future = self.context.submitJob[T, Array[T], Array[T]]( self, _.toArray, Seq(id), { (_, data) => result = data }, result ) future } @transient lazy val now: Array[T] = future.get() } } implicit class RDDFunctions[T: ClassTag](self: RDD[T]) { import scala.concurrent.ExecutionContext.Implicits.global def _toLocalIteratorPreemptively(capacity: Int): Iterator[Array[T]] = { val executions = self.partitions.indices.map { ii => PartitionExecution(self, ii) } val buffer = new ArrayBlockingQueue[Try[PartitionExecution[T]]](capacity) Future { executions.foreach { exe => buffer.put(Success(exe)) // may be blocking due to capacity exe.eager // non-blocking } }.onFailure { case e: Throwable => buffer.put(Failure(e)) } self
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971971#comment-16971971 ] zhao bo commented on SPARK-29106: - Er,, linaro seems in trouble. The VM can not online now. We will try to contact the maintainer asap. Sorry for that. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971968#comment-16971968 ] zhao bo commented on SPARK-29106: - Thanks very much, [~shaneknapp] . So apologize that the VM is down last night, but it's back now. Yeah, I also post an issue [1] to get the status of arm support in apache/arrow community. There are very little resource to support it, even from community and us. So I think pyarrow is difficult to support arm/finish for now. Also notes: We got some powerful ARM resources which could replace the current test VM, and we had test it good to go now. How do you think it? ;) [1] https://issues.apache.org/jira/browse/ARROW-7042 > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29851) DataSourceV2: Default behavior of dropping namespace is cascading
Terry Kim created SPARK-29851: - Summary: DataSourceV2: Default behavior of dropping namespace is cascading Key: SPARK-29851 URL: https://issues.apache.org/jira/browse/SPARK-29851 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Terry Kim Instead of introducing additional 'cascade' option to dropNamespace(), the default behavior of dropping a namespace will be cascading. Now, to implement the cascade option, Spark side needs to ensure a namespace is empty before calling dropNamespace(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29755) ClassCastException occurs when reading events from SHS
[ https://issues.apache.org/jira/browse/SPARK-29755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29755: -- Assignee: Jungtaek Lim > ClassCastException occurs when reading events from SHS > -- > > Key: SPARK-29755 > URL: https://issues.apache.org/jira/browse/SPARK-29755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > Looks like SPARK-28869 triggered a technical issue on jackson-scala: > https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges > {noformat} > 19/11/05 17:59:23 INFO FsHistoryProvider: Leasing disk manager space for app > app-20191105152223- / None... > 19/11/05 17:59:23 INFO FsHistoryProvider: Parsing > /apps/spark/eventlogs/app-20191105152223- to re-build UI... > 19/11/05 17:59:24 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.lang.ClassCastException: java.lang.Integer cannot be cast to > java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.deploy.history.FsHistoryProvider.shouldReloadLog(FsHistoryProvider.scala:585) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6(FsHistoryProvider.scala:458) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6$adapted(FsHistoryProvider.scala:444) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at > scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:444) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:267) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:190) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29755) ClassCastException occurs when reading events from SHS
[ https://issues.apache.org/jira/browse/SPARK-29755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29755. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26397 [https://github.com/apache/spark/pull/26397] > ClassCastException occurs when reading events from SHS > -- > > Key: SPARK-29755 > URL: https://issues.apache.org/jira/browse/SPARK-29755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > Looks like SPARK-28869 triggered a technical issue on jackson-scala: > https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges > {noformat} > 19/11/05 17:59:23 INFO FsHistoryProvider: Leasing disk manager space for app > app-20191105152223- / None... > 19/11/05 17:59:23 INFO FsHistoryProvider: Parsing > /apps/spark/eventlogs/app-20191105152223- to re-build UI... > 19/11/05 17:59:24 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.lang.ClassCastException: java.lang.Integer cannot be cast to > java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.deploy.history.FsHistoryProvider.shouldReloadLog(FsHistoryProvider.scala:585) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6(FsHistoryProvider.scala:458) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$6$adapted(FsHistoryProvider.scala:444) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at > scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:444) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:267) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:190) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output
[ https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-26154. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26108 [https://github.com/apache/spark/pull/26108] > Stream-stream joins - left outer join gives inconsistent output > --- > > Key: SPARK-26154 > URL: https://issues.apache.org/jira/browse/SPARK-26154 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2, 3.0.0 > Environment: Spark version - Spark 2.3.2 > OS- Suse 11 >Reporter: Haripriya >Assignee: Jungtaek Lim >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > Stream-stream joins using left outer join gives inconsistent output > The data processed once, is being processed again and gives null value. In > Batch 2, the input data "3" is processed. But again in batch 6, null value > is provided for same data > Steps > In spark-shell > {code:java} > scala> import org.apache.spark.sql.functions.{col, expr} > import org.apache.spark.sql.functions.{col, expr} > scala> import org.apache.spark.sql.streaming.Trigger > import org.apache.spark.sql.streaming.Trigger > scala> val lines_stream1 = spark.readStream. > | format("kafka"). > | option("kafka.bootstrap.servers", "ip:9092"). > | option("subscribe", "topic1"). > | option("includeTimestamp", true). > | load(). > | selectExpr("CAST (value AS String)","CAST(timestamp AS > TIMESTAMP)").as[(String,Timestamp)]. > | select(col("value") as("data"),col("timestamp") > as("recordTime")). > | select("data","recordTime"). > | withWatermark("recordTime", "5 seconds ") > lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [data: string, recordTime: timestamp] > scala> val lines_stream2 = spark.readStream. > | format("kafka"). > | option("kafka.bootstrap.servers", "ip:9092"). > | option("subscribe", "topic2"). > | option("includeTimestamp", value = true). > | load(). > | selectExpr("CAST (value AS String)","CAST(timestamp AS > TIMESTAMP)").as[(String,Timestamp)]. > | select(col("value") as("data1"),col("timestamp") > as("recordTime1")). > | select("data1","recordTime1"). > | withWatermark("recordTime1", "10 seconds ") > lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [data1: string, recordTime1: timestamp] > scala> val query = lines_stream1.join(lines_stream2, expr ( > | """ > | | data == data1 and > | | recordTime1 >= recordTime and > | | recordTime1 <= recordTime + interval 5 seconds > | """.stripMargin),"left"). > | writeStream. > | option("truncate","false"). > | outputMode("append"). > | format("console").option("checkpointLocation", > "/tmp/leftouter/"). > | trigger(Trigger.ProcessingTime ("5 seconds")). > | start() > query: org.apache.spark.sql.streaming.StreamingQuery = > org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b > {code} > Step2 : Start producing data > kafka-console-producer.sh --broker-list ip:9092 --topic topic1 > >1 > >2 > >3 > >4 > >5 > >aa > >bb > >cc > kafka-console-producer.sh --broker-list ip:9092 --topic topic2 > >2 > >2 > >3 > >4 > >5 > >aa > >cc > >ee > >ee > > Output obtained: > {code:java} > Batch: 0 > --- > ++--+-+---+ > |data|recordTime|data1|recordTime1| > ++--+-+---+ > ++--+-+---+ > --- > Batch: 1 > --- > ++--+-+---+ > |data|recordTime|data1|recordTime1| > ++--+-+---+ > ++--+-+---+ > --- > Batch: 2 > --- > ++---+-+---+ > |data|recordTime |data1|recordTime1| > ++---+-+---+ > |3 |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506| > |2 |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116| > ++---+-+---+ > --- > Batch: 3 > --- > ++---+-+---+ > |data|recordTime |data1|r
[jira] [Assigned] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output
[ https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-26154: -- Assignee: Jungtaek Lim > Stream-stream joins - left outer join gives inconsistent output > --- > > Key: SPARK-26154 > URL: https://issues.apache.org/jira/browse/SPARK-26154 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2, 3.0.0 > Environment: Spark version - Spark 2.3.2 > OS- Suse 11 >Reporter: Haripriya >Assignee: Jungtaek Lim >Priority: Blocker > Labels: correctness > > Stream-stream joins using left outer join gives inconsistent output > The data processed once, is being processed again and gives null value. In > Batch 2, the input data "3" is processed. But again in batch 6, null value > is provided for same data > Steps > In spark-shell > {code:java} > scala> import org.apache.spark.sql.functions.{col, expr} > import org.apache.spark.sql.functions.{col, expr} > scala> import org.apache.spark.sql.streaming.Trigger > import org.apache.spark.sql.streaming.Trigger > scala> val lines_stream1 = spark.readStream. > | format("kafka"). > | option("kafka.bootstrap.servers", "ip:9092"). > | option("subscribe", "topic1"). > | option("includeTimestamp", true). > | load(). > | selectExpr("CAST (value AS String)","CAST(timestamp AS > TIMESTAMP)").as[(String,Timestamp)]. > | select(col("value") as("data"),col("timestamp") > as("recordTime")). > | select("data","recordTime"). > | withWatermark("recordTime", "5 seconds ") > lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [data: string, recordTime: timestamp] > scala> val lines_stream2 = spark.readStream. > | format("kafka"). > | option("kafka.bootstrap.servers", "ip:9092"). > | option("subscribe", "topic2"). > | option("includeTimestamp", value = true). > | load(). > | selectExpr("CAST (value AS String)","CAST(timestamp AS > TIMESTAMP)").as[(String,Timestamp)]. > | select(col("value") as("data1"),col("timestamp") > as("recordTime1")). > | select("data1","recordTime1"). > | withWatermark("recordTime1", "10 seconds ") > lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = > [data1: string, recordTime1: timestamp] > scala> val query = lines_stream1.join(lines_stream2, expr ( > | """ > | | data == data1 and > | | recordTime1 >= recordTime and > | | recordTime1 <= recordTime + interval 5 seconds > | """.stripMargin),"left"). > | writeStream. > | option("truncate","false"). > | outputMode("append"). > | format("console").option("checkpointLocation", > "/tmp/leftouter/"). > | trigger(Trigger.ProcessingTime ("5 seconds")). > | start() > query: org.apache.spark.sql.streaming.StreamingQuery = > org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b > {code} > Step2 : Start producing data > kafka-console-producer.sh --broker-list ip:9092 --topic topic1 > >1 > >2 > >3 > >4 > >5 > >aa > >bb > >cc > kafka-console-producer.sh --broker-list ip:9092 --topic topic2 > >2 > >2 > >3 > >4 > >5 > >aa > >cc > >ee > >ee > > Output obtained: > {code:java} > Batch: 0 > --- > ++--+-+---+ > |data|recordTime|data1|recordTime1| > ++--+-+---+ > ++--+-+---+ > --- > Batch: 1 > --- > ++--+-+---+ > |data|recordTime|data1|recordTime1| > ++--+-+---+ > ++--+-+---+ > --- > Batch: 2 > --- > ++---+-+---+ > |data|recordTime |data1|recordTime1| > ++---+-+---+ > |3 |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506| > |2 |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116| > ++---+-+---+ > --- > Batch: 3 > --- > ++---+-+---+ > |data|recordTime |data1|recordTime1| > ++---+-+---+ > |4 |2018-11-22 20:09:38.654|4|2018-11
[jira] [Assigned] (SPARK-29766) Aggregate metrics asynchronously in SQL listener
[ https://issues.apache.org/jira/browse/SPARK-29766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29766: - Assignee: Marcelo Masiero Vanzin > Aggregate metrics asynchronously in SQL listener > > > Key: SPARK-29766 > URL: https://issues.apache.org/jira/browse/SPARK-29766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > > This is a follow up to SPARK-29562. > That change made metrics collection faster, and also sped up metrics > aggregation. But it is still too slow to execute in an event handler, so we > should do it asynchronously to minimize events being dropped by the listener > bus. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29766) Aggregate metrics asynchronously in SQL listener
[ https://issues.apache.org/jira/browse/SPARK-29766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29766. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26405 [https://github.com/apache/spark/pull/26405] > Aggregate metrics asynchronously in SQL listener > > > Key: SPARK-29766 > URL: https://issues.apache.org/jira/browse/SPARK-29766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > Fix For: 3.0.0 > > > This is a follow up to SPARK-29562. > That change made metrics collection faster, and also sped up metrics > aggregation. But it is still too slow to execute in an event handler, so we > should do it asynchronously to minimize events being dropped by the listener > bus. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29770) Allow setting spark.app.id when spark-submit for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-29770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29770. Resolution: Won't Fix See comments in PR. > Allow setting spark.app.id when spark-submit for Spark on Kubernetes > > > Key: SPARK-29770 > URL: https://issues.apache.org/jira/browse/SPARK-29770 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Liu Runzhong >Priority: Minor > Labels: easyfix > > when the user provides `spark.app.id` by `spark-submit`, it's actually > doing nothing to change the `spark.app.id`, as `spark.app.id` can only be set > by `kubernetesAppId` every time, which makes the users feel confused. > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L196] > Knowing that `spark.app.id` would be labeled to Driver/Executor pods and > other resources and the strict limitation of the label values, but I think it > would be more flexible to users to decide how to generate the `spark.app.id` > by themselves. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29672) remove python2 tests and test infra
[ https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971868#comment-16971868 ] Shane Knapp commented on SPARK-29672: - the PR is complete, all tests pass and i'm waiting for the word to merge it in to master! > remove python2 tests and test infra > --- > > Key: SPARK-29672 > URL: https://issues.apache.org/jira/browse/SPARK-29672 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] > it's time, at least for 3.0+ to remove python 2.7 test support and migrate > the test execution framework to python 3.6. > this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971866#comment-16971866 ] Shane Knapp commented on SPARK-29803: - so after reading the python2 EOL roadmap/announcement it looks like (at some point in early 2020) that we WILL be dropping python2 support in 3.0+: https://spark.apache.org/news/plan-for-dropping-python-2-support.html so, i believe that this ticket should be dealt with as part of the release of spark 3.x without python2 support (and spark 2.4 will not be touched). unless, of course, i'm misunderstanding something... which could be entirely true. :) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29850) sort-merge-join an empty table should not memory leak
[ https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29850: -- Affects Version/s: 2.3.4 > sort-merge-join an empty table should not memory leak > - > > Key: SPARK-29850 > URL: https://issues.apache.org/jira/browse/SPARK-29850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29850) sort-merge-join an empty table should not memory leak
[ https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29850: -- Affects Version/s: 2.4.4 > sort-merge-join an empty table should not memory leak > - > > Key: SPARK-29850 > URL: https://issues.apache.org/jira/browse/SPARK-29850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-27189: Summary: Add Executor metrics and memory usage instrumentation to the metrics system (was: Add Executor level memory usage metrics to the metrics system) > Add Executor metrics and memory usage instrumentation to the metrics system > --- > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default
[ https://issues.apache.org/jira/browse/SPARK-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-29805. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26443 [https://github.com/apache/spark/pull/26443] > Enable nested schema pruning and pruning on expressions by default > -- > > Key: SPARK-29805 > URL: https://issues.apache.org/jira/browse/SPARK-29805 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29801) ML models unify toString method
[ https://issues.apache.org/jira/browse/SPARK-29801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29801. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26439 [https://github.com/apache/spark/pull/26439] > ML models unify toString method > --- > > Key: SPARK-29801 > URL: https://issues.apache.org/jira/browse/SPARK-29801 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > ML models should extend \{{toString}} method to expose basic information. > Current some algs (GBT/RF/LoR) had done this, while others not yet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29801) ML models unify toString method
[ https://issues.apache.org/jira/browse/SPARK-29801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29801: - Assignee: zhengruifeng > ML models unify toString method > --- > > Key: SPARK-29801 > URL: https://issues.apache.org/jira/browse/SPARK-29801 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > ML models should extend \{{toString}} method to expose basic information. > Current some algs (GBT/RF/LoR) had done this, while others not yet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971798#comment-16971798 ] Thomas Graves commented on SPARK-29762: --- this is actually more complex then you might think because the resource configs are just configs. So you have spark.executor.resource.gpu.amount for instance, the corresponding task config would be spark.task.resource.gpu.amount. Where gpu could be any resource. The way the code is written now if it just grabs all the resources and iterates over them in various places and assume you have specified a task requirement for each executor resource. If you remove that assumption you now have to be careful about what you are iterating over and really you have to use the resources from the executor configs, not the task configs. But you still have to read the task configs and if a resource isn't there then default it to 1. > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29825) Add join conditions in join-related tests of SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29825: - Assignee: Takeshi Yamamuro > Add join conditions in join-related tests of SQLQueryTestSuite > -- > > Key: SPARK-29825 > URL: https://issues.apache.org/jira/browse/SPARK-29825 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29825) Add join conditions in join-related tests of SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29825. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26459 [https://github.com/apache/spark/pull/26459] > Add join conditions in join-related tests of SQLQueryTestSuite > -- > > Key: SPARK-29825 > URL: https://issues.apache.org/jira/browse/SPARK-29825 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29808) StopWordsRemover should support multi-cols
[ https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971757#comment-16971757 ] Huaxin Gao commented on SPARK-29808: I will work on this. [~podongfeng] > StopWordsRemover should support multi-cols > -- > > Key: SPARK-29808 > URL: https://issues.apache.org/jira/browse/SPARK-29808 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > As a basic Transformer, StopWordsRemover should support multi-cols. > Param {color:#93a6f5}stopWords{color} can be applied across all columns. > {color:#93a6f5} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971739#comment-16971739 ] Hyukjin Kwon commented on SPARK-29776: -- [~Ankitraj] have you made some progresses on this? > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > {code} > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > {code} > +---+ > | _c0 | > +---+ > | NULL | > +---+ > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971740#comment-16971740 ] Hyukjin Kwon commented on SPARK-29773: -- ping [~aermakov], have you tried this out? > Unable to process empty ORC files in Hive Table using Spark SQL > --- > > Key: SPARK-29773 > URL: https://issues.apache.org/jira/browse/SPARK-29773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: Centos 7, Spark 2.3.1, Hive 2.3.0 >Reporter: Alexander Ermakov >Priority: Major > > Unable to process empty ORC files in Hive Table using Spark SQL. It seems > that a problem with class > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits() > Stack trace: > {code:java} > 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct > _tech_load_dt from dl_raw.tpaccsieee_ut_data_address] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange hashpartitioning(_tech_load_dt#1374, 200) > +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], > output=[_tech_load_dt#1374]) >+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation > `dl_raw`.`tpaccsieee_ut_data_address`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, > address_adm#1309, address_md#1310, adress_doc#1311, building#1312, > change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, > city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, > code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, > date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, > district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeM
[jira] [Commented] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971738#comment-16971738 ] Hyukjin Kwon commented on SPARK-29792: -- What does this JIRA mean? Can you fill the description > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971737#comment-16971737 ] Hyukjin Kwon commented on SPARK-29797: -- ping [~isaacm] do you have any suggestion about how we should store the metadata? Otherwise, let's leave this closed. > Read key-value metadata in Parquet files written by Apache Arrow > > > Key: SPARK-29797 > URL: https://issues.apache.org/jira/browse/SPARK-29797 > Project: Spark > Issue Type: New Feature > Components: Java API, PySpark >Affects Versions: 2.4.4 > Environment: Apache Arrow 0.14.1 built on Windows x86. > >Reporter: Isaac Myers >Priority: Major > Labels: features > Attachments: minimal_working_example.cpp > > > Key-value (user) metadata written to Parquet file from Apache Arrow c++ is > not readable in Spark (PySpark or Java API). I can only find field-level > metadata dictionaries in the schema and no other functions in the API that > indicate the presence of file-level key-value metadata. The attached code > demonstrates creation and retrieval of file-level metadata using the Apache > Arrow API. > {code:java} > #include #include #include #include #include > #include #include > #include #include > //#include > int main(int argc, char* argv[]){ /* Create > Parquet File **/ arrow::Status st; > arrow::MemoryPool* pool = arrow::default_memory_pool(); > // Create Schema and fields with metadata > std::vector> fields; > std::unordered_map a_keyval; a_keyval["unit"] = > "sec"; a_keyval["note"] = "not the standard millisecond unit"; > arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field > = arrow::field("a", arrow::int16(), false, a_md.Copy()); > fields.push_back(a_field); > std::unordered_map b_keyval; b_keyval["unit"] = > "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr > b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); > fields.push_back(b_field); > std::shared_ptr schema = arrow::schema(fields); > // Add metadata to schema. std::unordered_map > schema_keyval; schema_keyval["classification"] = "Type 0"; > arrow::KeyValueMetadata schema_md(schema_keyval); schema = > schema->AddMetadata(schema_md.Copy()); > // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; > std::vector a_data(rowgroup_size, 0); std::vector > b_data(rowgroup_size, 0); > for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = > rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder > b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = > b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; > st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; > st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; > std::shared_ptr a_arr_ptr; std::shared_ptr > b_arr_ptr; > arrow::ArrayVector arr_vec; st = a_bldr.Finish(&a_arr_ptr); if (!st.ok()) > return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(&b_arr_ptr); if > (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); > std::shared_ptr table = arrow::Table::Make(schema, arr_vec); > // Test metadata printf("\nMetadata from original schema:\n"); > printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", > schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", > schema->field(1)->metadata()->ToString().c_str()); > std::shared_ptr table_schema = table->schema(); > printf("\nMetadata from schema retrieved from table (should be the > same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); > // Open file and write table. std::string file_name = "test.parquet"; > std::shared_ptr ostream; st = > arrow::io::FileOutputStream::Open(file_name, &ostream); if (!st.ok()) return > 1; > std::unique_ptr writer; > std::shared_ptr props = > parquet::default_writer_properties(); st = > parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, &writer); if > (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if > (!st.ok()) return 1; > // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = > ostream->Close(); if (!st.ok()) return 1; > /* Read Parquet File > **/ > // Create new memory pool. Not sure if this is necessary. > //arrow::MemoryPool* pool2 = arrow::default_memory_pool(); > // Open file reader. std::shared_ptr input_file; st > = arrow::io::ReadableFile::Open(file_name, pool, &input_file); if (!st.ok()) > re
[jira] [Commented] (SPARK-29673) upgrade jenkins pypy to PyPy3.6 v7.2.0
[ https://issues.apache.org/jira/browse/SPARK-29673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971736#comment-16971736 ] Hyukjin Kwon commented on SPARK-29673: -- Thanks [~shaneknapp]. I will make a try soon in probably a couple of weeks. > upgrade jenkins pypy to PyPy3.6 v7.2.0 > -- > > Key: SPARK-29673 > URL: https://issues.apache.org/jira/browse/SPARK-29673 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971735#comment-16971735 ] Hyukjin Kwon commented on SPARK-29803: -- {quote} as spark 3.0+ technically does NOT support python versions earlier than 3.5 {quote} Hey, [~shaneknapp], just to make sure we're on the same page, I think Spark 3.0 will still support Python 2.7, 3.4 and 3.5 although they are deprecated. > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29804) Spark-shell is failing on YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971731#comment-16971731 ] Hyukjin Kwon commented on SPARK-29804: -- Please don't set target version and Critical+ which are usually reserved for committers. Also, please don't set fix version which is usually set when it's actually fixed. Lastly, please just don't copy and paste error messages. If this is an issue, please provide full details with a reproducer. If this is a question, it should go to mailing list - https://spark.apache.org/community.html > Spark-shell is failing on YARN mode > --- > > Key: SPARK-29804 > URL: https://issues.apache.org/jira/browse/SPARK-29804 > Project: Spark > Issue Type: Question > Components: YARN >Affects Versions: 2.4.4 > Environment: Spark2.4.4, Apache Hadoop 3.1.2 >Reporter: Srujan A >Priority: Major > > I am trying to run the spark-shell on YARN mode from containers and it's > failing on below reason. Please help me out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29804) Spark-shell is failing on YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29804. -- Fix Version/s: (was: 2.4.4) Resolution: Invalid > Spark-shell is failing on YARN mode > --- > > Key: SPARK-29804 > URL: https://issues.apache.org/jira/browse/SPARK-29804 > Project: Spark > Issue Type: Question > Components: YARN >Affects Versions: 2.4.4 > Environment: Spark2.4.4, Apache Hadoop 3.1.2 >Reporter: Srujan A >Priority: Major > > I am trying to run the spark-shell on YARN mode from containers and it's > failing on below reason. Please help me out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29804) Spark-shell is failing on YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29804: - Target Version/s: (was: 2.4.4) > Spark-shell is failing on YARN mode > --- > > Key: SPARK-29804 > URL: https://issues.apache.org/jira/browse/SPARK-29804 > Project: Spark > Issue Type: Question > Components: YARN >Affects Versions: 2.4.4 > Environment: Spark2.4.4, Apache Hadoop 3.1.2 >Reporter: Srujan A >Priority: Blocker > Fix For: 2.4.4 > > > I am trying to run the spark-shell on YARN mode from containers and it's > failing on below reason. Please help me out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29804) Spark-shell is failing on YARN mode
[ https://issues.apache.org/jira/browse/SPARK-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29804: - Priority: Major (was: Blocker) > Spark-shell is failing on YARN mode > --- > > Key: SPARK-29804 > URL: https://issues.apache.org/jira/browse/SPARK-29804 > Project: Spark > Issue Type: Question > Components: YARN >Affects Versions: 2.4.4 > Environment: Spark2.4.4, Apache Hadoop 3.1.2 >Reporter: Srujan A >Priority: Major > Fix For: 2.4.4 > > > I am trying to run the spark-shell on YARN mode from containers and it's > failing on below reason. Please help me out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29806) Using multiline option for a JSON file which is not multiline results in silent truncation of data.
[ https://issues.apache.org/jira/browse/SPARK-29806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971730#comment-16971730 ] Hyukjin Kwon commented on SPARK-29806: -- {{multiline}} in JSON source currently only supports one JSON object or a JSON array. > Using multiline option for a JSON file which is not multiline results in > silent truncation of data. > --- > > Key: SPARK-29806 > URL: https://issues.apache.org/jira/browse/SPARK-29806 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Dilip Biswal >Priority: Major > > The content of input Json File. > {code:java} > {"name":"John", "id":"100"} > {"name":"Marry","id":"200"}{code} > The above is valid json file but every record is in single line. But trying > to read this file > with a multiline option with FAILFAST mode, results in data truncation > without any error. > {code:java} > scala> spark.read.option("multiLine", true).option("mode", > "FAILFAST").format("json").load("/tmp/json").show(false) > +---++ > |id |name| > +---++ > |100|John| > +---++ > scala> spark.read.option("mode", > "FAILFAST").format("json").load("/tmp/json").show(false) > +---+-+ > |id |name | > +---+-+ > |100|John | > |200|Marry| > +---+-+{code} > I think Spark should return an error in this case especially in FAILFAST > mode. This can be a common user error and we should not do silent data > truncation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer
[ https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971727#comment-16971727 ] Hyukjin Kwon commented on SPARK-29830: -- (Please avoid to set target version which is usually reserved for committers) > PySpark.context.Sparkcontext.binaryfiles improved memory with buffer > > > Key: SPARK-29830 > URL: https://issues.apache.org/jira/browse/SPARK-29830 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Jörn Franke >Priority: Major > > At the moment, Pyspark reads binary files into a byte array directly. This > means it reads the full binary file immediately into memory, which is 1) > memory in-efficient 2) differs from the Scala implementation (see pyspark > here: > [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles). > > |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles] > In Scala, Spark returns a PortableDataStream, which means the application > does not need to read the full content of the stream in memory to work on it > (see > [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).] > > Hence, it is proposed to adapt the Pyspark implementation to return something > similar to a PortableDataStream in Scala (e.g. > [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].] > > Reading binary files in an efficient manner is crucial for many IoT > applications, but potentially also other fields (e.g. disk image analysis in > forensics). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer
[ https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29830: - Target Version/s: (was: 3.0.0) > PySpark.context.Sparkcontext.binaryfiles improved memory with buffer > > > Key: SPARK-29830 > URL: https://issues.apache.org/jira/browse/SPARK-29830 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Jörn Franke >Priority: Major > > At the moment, Pyspark reads binary files into a byte array directly. This > means it reads the full binary file immediately into memory, which is 1) > memory in-efficient 2) differs from the Scala implementation (see pyspark > here: > [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles). > > |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles] > In Scala, Spark returns a PortableDataStream, which means the application > does not need to read the full content of the stream in memory to work on it > (see > [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).] > > Hence, it is proposed to adapt the Pyspark implementation to return something > similar to a PortableDataStream in Scala (e.g. > [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].] > > Reading binary files in an efficient manner is crucial for many IoT > applications, but potentially also other fields (e.g. disk image analysis in > forensics). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971723#comment-16971723 ] Ruslan Dautkhanov commented on SPARK-22340: --- Glad to see this is solved. A nice side-effect should be somewhat better performance on some cases involving heavy python-java communication on multi-numa/ multi-socket configurations. With static threads, Linux kernel will actually have a chance to schedule threads on processors/cores that are more local to data's numa placement. > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Mortenson >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971680#comment-16971680 ] Aman Omer commented on SPARK-29838: --- Sure, I have started coding for this one and analyzed https://issues.apache.org/jira/browse/SPARK-29840 , https://issues.apache.org/jira/browse/SPARK-29842 . I will soon raise PR. [~Ngone51] you can plan your work accordingly and kindly inform (in comments) before starting to avoid duplicate efforts. > PostgreSQL dialect: cast to timestamp > - > > Key: SPARK-29838 > URL: https://issues.apache.org/jira/browse/SPARK-29838 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971654#comment-16971654 ] Sean R. Owen commented on SPARK-29844: -- Nice, I wonder what else CacheCheck will turn up? I think #1 isnt a problem in master, at least judging by the PR. #2 seems valid. Yes the caller has to deal with unpersisting these if desired. > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Minor > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29850) sort-merge-join an empty table should not memory leak
Wenchen Fan created SPARK-29850: --- Summary: sort-merge-join an empty table should not memory leak Key: SPARK-29850 URL: https://issues.apache.org/jira/browse/SPARK-29850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29844: - Priority: Minor (was: Major) > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Minor > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29848) PostgreSQL dialect: cast to bigint
[ https://issues.apache.org/jira/browse/SPARK-29848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971580#comment-16971580 ] Rakesh Raushan commented on SPARK-29848: I will work on this one > PostgreSQL dialect: cast to bigint > -- > > Key: SPARK-29848 > URL: https://issues.apache.org/jira/browse/SPARK-29848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark: > 0: jdbc:hive2://10.18.19.208:23040/default> select CAST('0xcc' AS > bigint); > +---+ > | CAST(0xcc AS BIGINT) | > +---+ > | NULL | > +---+ > Postgre SQL > 22P02: invalid input syntax for integer: "0xcc" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971579#comment-16971579 ] Rakesh Raushan commented on SPARK-29849: I will work on this > Spark trunc() func does not support for number group as PostgreSQL > -- > > Key: SPARK-29849 > URL: https://issues.apache.org/jira/browse/SPARK-29849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > PostgreSQL trunc() function accepts number group as below > SELECT trunc(1234567891.1234567891,4); > output > |1|1234567891,1234| > Spark does not accept > jdbc:hive2://10.18.19.208:23040/default> SELECT > trunc(1234567891.1234567891D,4); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: > argument 1 requires date type, however, '1.2345678911234567E9D' is of double > type.; line 1 pos 7; > 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), > None)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971574#comment-16971574 ] Aman Omer commented on SPARK-29849: --- I am checking this issue. > Spark trunc() func does not support for number group as PostgreSQL > -- > > Key: SPARK-29849 > URL: https://issues.apache.org/jira/browse/SPARK-29849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > PostgreSQL trunc() function accepts number group as below > SELECT trunc(1234567891.1234567891,4); > output > |1|1234567891,1234| > Spark does not accept > jdbc:hive2://10.18.19.208:23040/default> SELECT > trunc(1234567891.1234567891D,4); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: > argument 1 requires date type, however, '1.2345678911234567E9D' is of double > type.; line 1 pos 7; > 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), > None)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-29849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Omer updated SPARK-29849: -- Comment: was deleted (was: I am checking this issue.) > Spark trunc() func does not support for number group as PostgreSQL > -- > > Key: SPARK-29849 > URL: https://issues.apache.org/jira/browse/SPARK-29849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > PostgreSQL trunc() function accepts number group as below > SELECT trunc(1234567891.1234567891,4); > output > |1|1234567891,1234| > Spark does not accept > jdbc:hive2://10.18.19.208:23040/default> SELECT > trunc(1234567891.1234567891D,4); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: > argument 1 requires date type, however, '1.2345678911234567E9D' is of double > type.; line 1 pos 7; > 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), > None)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971570#comment-16971570 ] wuyi commented on SPARK-29838: -- Hey guys, what's going on here ? I see [~aman_omer] you has commented several sub tasks saying you're going to do them. But I'm also planning to do these tasks. Maybe we can cooperate with each other to avoid duplicate work ? > PostgreSQL dialect: cast to timestamp > - > > Key: SPARK-29838 > URL: https://issues.apache.org/jira/browse/SPARK-29838 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29849) Spark trunc() func does not support for number group as PostgreSQL
ABHISHEK KUMAR GUPTA created SPARK-29849: Summary: Spark trunc() func does not support for number group as PostgreSQL Key: SPARK-29849 URL: https://issues.apache.org/jira/browse/SPARK-29849 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA PostgreSQL trunc() function accepts number group as below SELECT trunc(1234567891.1234567891,4); output |1|1234567891,1234| Spark does not accept jdbc:hive2://10.18.19.208:23040/default> SELECT trunc(1234567891.1234567891D,4); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'trunc(1.2345678911234567E9D, CAST(4 AS STRING))' due to data type mismatch: argument 1 requires date type, however, '1.2345678911234567E9D' is of double type.; line 1 pos 7; 'Project [unresolvedalias(trunc(1.2345678911234567E9, cast(4 as string)), None)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Wang updated SPARK-29844: -- Affects Version/s: (was: 3.0.0) 2.4.3 > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Major > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29822) Cast error when there are spaces between signs and values
[ https://issues.apache.org/jira/browse/SPARK-29822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29822: --- Assignee: Kent Yao > Cast error when there are spaces between signs and values > - > > Key: SPARK-29822 > URL: https://issues.apache.org/jira/browse/SPARK-29822 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > With the latest string to literal optimization, some interval strings can not > be cast when there are some spaces between signs and unit values. > How to reproduce, > {code:java} > select cast(v as interval) from values ('+ 1 second') t(v); > select cast(v as interval) from values ('- 1 second') t(v); > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29822) Cast error when there are spaces between signs and values
[ https://issues.apache.org/jira/browse/SPARK-29822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29822. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26449 [https://github.com/apache/spark/pull/26449] > Cast error when there are spaces between signs and values > - > > Key: SPARK-29822 > URL: https://issues.apache.org/jira/browse/SPARK-29822 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > With the latest string to literal optimization, some interval strings can not > be cast when there are some spaces between signs and unit values. > How to reproduce, > {code:java} > select cast(v as interval) from values ('+ 1 second') t(v); > select cast(v as interval) from values ('- 1 second') t(v); > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29848) PostgreSQL dialect: cast to bigint
ABHISHEK KUMAR GUPTA created SPARK-29848: Summary: PostgreSQL dialect: cast to bigint Key: SPARK-29848 URL: https://issues.apache.org/jira/browse/SPARK-29848 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark: 0: jdbc:hive2://10.18.19.208:23040/default> select CAST('0xcc' AS bigint); +---+ | CAST(0xcc AS BIGINT) | +---+ | NULL | +---+ Postgre SQL 22P02: invalid input syntax for integer: "0xcc" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29845) PostgreSQL dialect: cast to decimal
[ https://issues.apache.org/jira/browse/SPARK-29845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971516#comment-16971516 ] Rakesh Raushan commented on SPARK-29845: I will work on this > PostgreSQL dialect: cast to decimal > --- > > Key: SPARK-29845 > URL: https://issues.apache.org/jira/browse/SPARK-29845 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to decimal behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29847) PostgreSQL dialect: cast to varchar
[ https://issues.apache.org/jira/browse/SPARK-29847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971513#comment-16971513 ] Aman Omer commented on SPARK-29847: --- I am checking this one. > PostgreSQL dialect: cast to varchar > --- > > Key: SPARK-29847 > URL: https://issues.apache.org/jira/browse/SPARK-29847 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > In Spark > jdbc:hive2://10.18.19.208:23040/default> select cast('10.345bb' as > varchar(10)); > +---+ > | CAST(10.345bb AS STRING) | > +---+ > | 10.345*bb* | > +---+ > > In PostgreSQL > select cast('10.345bb' as varchar(10)); > varchar varchar1 *10.345* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29846) PostgreSQL dialect: cast to char
[ https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971505#comment-16971505 ] Ankit Raj Boudh commented on SPARK-29846: - I will raise PR for this. > PostgreSQL dialect: cast to char > > > Key: SPARK-29846 > URL: https://issues.apache.org/jira/browse/SPARK-29846 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to char behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. > {code:java} > spark-sql> select cast ('10.22333' as > char(5)); > 10.22333 > Time taken: 0.062 seconds, Fetched 1 row(s) > spark-sql> > {code} > *postgresql* > select cast ('10.22333' as char(5)); > > || ||bpchar|| > |1|10.22| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29847) PostgreSQL dialect: cast to varchar
ABHISHEK KUMAR GUPTA created SPARK-29847: Summary: PostgreSQL dialect: cast to varchar Key: SPARK-29847 URL: https://issues.apache.org/jira/browse/SPARK-29847 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA In Spark jdbc:hive2://10.18.19.208:23040/default> select cast('10.345bb' as varchar(10)); +---+ | CAST(10.345bb AS STRING) | +---+ | 10.345*bb* | +---+ In PostgreSQL select cast('10.345bb' as varchar(10)); varchar varchar1 *10.345* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29846) PostgreSQL dialect: cast to char
[ https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jobit mathew updated SPARK-29846: - Description: Make SparkSQL's cast to char behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. {code:java} spark-sql> select cast ('10.22333' as char(5)); 10.22333 Time taken: 0.062 seconds, Fetched 1 row(s) spark-sql> {code} *postgresql* select cast ('10.22333' as char(5)); || ||bpchar|| |1|10.22| was: Make SparkSQL's cast to char behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. {code:java} spark-sql> select cast ('10.22333' as char(5)); 10.22333 Time taken: 0.062 seconds, Fetched 1 row(s) spark-sql> *postgresql* select cast ('10.22333' as char(5)); bpchar 1 10.22 {code} > PostgreSQL dialect: cast to char > > > Key: SPARK-29846 > URL: https://issues.apache.org/jira/browse/SPARK-29846 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to char behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. > {code:java} > spark-sql> select cast ('10.22333' as > char(5)); > 10.22333 > Time taken: 0.062 seconds, Fetched 1 row(s) > spark-sql> > {code} > *postgresql* > select cast ('10.22333' as char(5)); > > || ||bpchar|| > |1|10.22| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29846) PostgreSQL dialect: cast to char
[ https://issues.apache.org/jira/browse/SPARK-29846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jobit mathew updated SPARK-29846: - Description: Make SparkSQL's cast to char behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. {code:java} spark-sql> select cast ('10.22333' as char(5)); 10.22333 Time taken: 0.062 seconds, Fetched 1 row(s) spark-sql> *postgresql* select cast ('10.22333' as char(5)); bpchar 1 10.22 {code} was: Make SparkSQL's cast to char behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. > PostgreSQL dialect: cast to char > > > Key: SPARK-29846 > URL: https://issues.apache.org/jira/browse/SPARK-29846 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to char behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. > {code:java} > spark-sql> select cast ('10.22333' as > char(5)); > 10.22333 > Time taken: 0.062 seconds, Fetched 1 row(s) > spark-sql> > *postgresql* > select cast ('10.22333' as char(5)); > bpchar > 1 10.22 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29846) PostgreSQL dialect: cast to char
jobit mathew created SPARK-29846: Summary: PostgreSQL dialect: cast to char Key: SPARK-29846 URL: https://issues.apache.org/jira/browse/SPARK-29846 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to char behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29845) PostgreSQL dialect: cast to decimal
jobit mathew created SPARK-29845: Summary: PostgreSQL dialect: cast to decimal Key: SPARK-29845 URL: https://issues.apache.org/jira/browse/SPARK-29845 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to decimal behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29840) PostgreSQL dialect: cast to integer
[ https://issues.apache.org/jira/browse/SPARK-29840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971476#comment-16971476 ] Aman Omer commented on SPARK-29840: --- Working on this. > PostgreSQL dialect: cast to integer > --- > > Key: SPARK-29840 > URL: https://issues.apache.org/jira/browse/SPARK-29840 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to integer behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. > Example:*currently spark sql* > {code:java} > spark-sql> select CAST ('10C' AS INTEGER); > NULL > Time taken: 0.051 seconds, Fetched 1 row(s) > spark-sql> > {code} > *postgresql* > {code:java} > postgresql > select CAST ('10C' AS INTEGER); > Error(s), warning(s): > 22P02: invalid input syntax for integer: "10C" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29842) PostgreSQL dialect: cast to double
[ https://issues.apache.org/jira/browse/SPARK-29842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971477#comment-16971477 ] Aman Omer commented on SPARK-29842: --- I will work on this > PostgreSQL dialect: cast to double > -- > > Key: SPARK-29842 > URL: https://issues.apache.org/jira/browse/SPARK-29842 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to double behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. > some examples > {code:java} > spark-sql> select CAST ('10.2' AS DOUBLE PRECISION); > Error in query: > extraneous input 'PRECISION' expecting ')'(line 1, pos 30) > == SQL == > select CAST ('10.2' AS DOUBLE PRECISION) > --^^^ > spark-sql> select CAST ('10.2' AS DOUBLE PRECISION); > Error in query: > extraneous input 'PRECISION' expecting ')'(line 1, pos 30) > == SQL == > select CAST ('10.2' AS DOUBLE PRECISION) > --^^^ > spark-sql> select CAST ('10.2' AS DOUBLE); > 10.2 > Time taken: 0.08 seconds, Fetched 1 row(s) > spark-sql> select CAST ('10.' AS DOUBLE); > 10. > Time taken: 0.08 seconds, Fetched 1 row(s) > spark-sql> select CAST ('ff' AS DOUBLE); > NULL > Time taken: 0.08 seconds, Fetched 1 row(s) > spark-sql> select CAST ('1' AS DOUBLE); > 1.1112E16 > Time taken: 0.067 seconds, Fetched 1 row(s) > spark-sql> > {code} > Postgresql > select CAST ('10.222' AS DOUBLE PRECISION); > select CAST ('1' AS DOUBLE PRECISION); > select CAST ('ff' AS DOUBLE PRECISION); > > > || ||float8|| > |1|10,222| > > || ||float8|| > |1|1,11E+16| > Error(s), warning(s): > 22P02: invalid input syntax for type double precision: "ff" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29843) PostgreSQL dialect: cast to float
[ https://issues.apache.org/jira/browse/SPARK-29843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971478#comment-16971478 ] Aman Omer commented on SPARK-29843: --- I will work on this > PostgreSQL dialect: cast to float > - > > Key: SPARK-29843 > URL: https://issues.apache.org/jira/browse/SPARK-29843 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to float behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Wang updated SPARK-29844: -- Summary: Improper unpersist strategy in ml.recommendation.ASL.train (was: Wrong unpersist strategy in ml.recommendation.ASL.train) > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29844) Wrong unpersist strategy in ml.recommendation.ASL.train
Dong Wang created SPARK-29844: - Summary: Wrong unpersist strategy in ml.recommendation.ASL.train Key: SPARK-29844 URL: https://issues.apache.org/jira/browse/SPARK-29844 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: Dong Wang In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the end of the method, these RDDs invoke unpersist(), but the timings of unpersist is not right, which will cause recomputation and memory waste. {code:scala} val userIdAndFactors = userInBlocks .mapValues(_.srcIds) .join(userFactors) .mapPartitions({ items => items.flatMap { case (_, (ids, factors)) => ids.view.zip(factors) } // Preserve the partitioning because IDs are consistent with the partitioners in userInBlocks // and userFactors. }, preservesPartitioning = true) .setName("userFactors") .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix val itemIdAndFactors = itemInBlocks .mapValues(_.srcIds) .join(itemFactors) .mapPartitions({ items => items.flatMap { case (_, (ids, factors)) => ids.view.zip(factors) } }, preservesPartitioning = true) .setName("itemFactors") .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix if (finalRDDStorageLevel != StorageLevel.NONE) { userIdAndFactors.count() itemFactors.unpersist() // Premature unpersist itemIdAndFactors.count() userInBlocks.unpersist() // Lagging unpersist userOutBlocks.unpersist() // Lagging unpersist itemInBlocks.unpersist() itemOutBlocks.unpersist() // Lagging unpersist blockRatings.unpersist() // Lagging unpersist } (userIdAndFactors, itemIdAndFactors) } {code} 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use itemFactors. So itemFactors will be recomputed. 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too late. The final action - itemIdAndFactors.count() will not use these RDDs, so these RDDs can be unpersisted before it to save memory. By the way, itemIdAndFactors is persisted here but will never be unpersisted util the application ends. It may hurts the performance, but I think it's hard to fix. This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29843) PostgreSQL dialect: cast to float
jobit mathew created SPARK-29843: Summary: PostgreSQL dialect: cast to float Key: SPARK-29843 URL: https://issues.apache.org/jira/browse/SPARK-29843 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to float behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29842) PostgreSQL dialect: cast to double
jobit mathew created SPARK-29842: Summary: PostgreSQL dialect: cast to double Key: SPARK-29842 URL: https://issues.apache.org/jira/browse/SPARK-29842 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to double behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. some examples {code:java} spark-sql> select CAST ('10.2' AS DOUBLE PRECISION); Error in query: extraneous input 'PRECISION' expecting ')'(line 1, pos 30) == SQL == select CAST ('10.2' AS DOUBLE PRECISION) --^^^ spark-sql> select CAST ('10.2' AS DOUBLE PRECISION); Error in query: extraneous input 'PRECISION' expecting ')'(line 1, pos 30) == SQL == select CAST ('10.2' AS DOUBLE PRECISION) --^^^ spark-sql> select CAST ('10.2' AS DOUBLE); 10.2 Time taken: 0.08 seconds, Fetched 1 row(s) spark-sql> select CAST ('10.' AS DOUBLE); 10. Time taken: 0.08 seconds, Fetched 1 row(s) spark-sql> select CAST ('ff' AS DOUBLE); NULL Time taken: 0.08 seconds, Fetched 1 row(s) spark-sql> select CAST ('1' AS DOUBLE); 1.1112E16 Time taken: 0.067 seconds, Fetched 1 row(s) spark-sql> {code} Postgresql select CAST ('10.222' AS DOUBLE PRECISION); select CAST ('1' AS DOUBLE PRECISION); select CAST ('ff' AS DOUBLE PRECISION); || ||float8|| |1|10,222| || ||float8|| |1|1,11E+16| Error(s), warning(s): 22P02: invalid input syntax for type double precision: "ff" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29841) PostgreSQL dialect: cast to date
[ https://issues.apache.org/jira/browse/SPARK-29841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971470#comment-16971470 ] pavithra ramachandran commented on SPARK-29841: --- i will check > PostgreSQL dialect: cast to date > > > Key: SPARK-29841 > URL: https://issues.apache.org/jira/browse/SPARK-29841 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to date behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29841) PostgreSQL dialect: cast to date
jobit mathew created SPARK-29841: Summary: PostgreSQL dialect: cast to date Key: SPARK-29841 URL: https://issues.apache.org/jira/browse/SPARK-29841 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to date behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29840) PostgreSQL dialect: cast to integer
jobit mathew created SPARK-29840: Summary: PostgreSQL dialect: cast to integer Key: SPARK-29840 URL: https://issues.apache.org/jira/browse/SPARK-29840 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to integer behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. Example:*currently spark sql* {code:java} spark-sql> select CAST ('10C' AS INTEGER); NULL Time taken: 0.051 seconds, Fetched 1 row(s) spark-sql> {code} *postgresql* {code:java} postgresql select CAST ('10C' AS INTEGER); Error(s), warning(s): 22P02: invalid input syntax for integer: "10C" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29839) Supporting STORED AS in CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-29839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29839: --- Summary: Supporting STORED AS in CREATE TABLE LIKE (was: Support STORED AS in CREATE TABLE LIKE) > Supporting STORED AS in CREATE TABLE LIKE > - > > Key: SPARK-29839 > URL: https://issues.apache.org/jira/browse/SPARK-29839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > In SPARK-29421, we can specify a different table provider for {{CREATE TABLE > LIKE}} via {{USING provider}}. > Hive support STORED AS new file format syntax: > {code} > CREATE TABLE tbl(a int) STORED AS TEXTFILE; > CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; > {code} > For Hive compatibility, we should also support {{STORED AS}} in {{CREATE > TABLE LIKE}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29839) Support STORED AS in CREATE TABLE LIKE
Lantao Jin created SPARK-29839: -- Summary: Support STORED AS in CREATE TABLE LIKE Key: SPARK-29839 URL: https://issues.apache.org/jira/browse/SPARK-29839 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin In SPARK-29421, we can specify a different table provider for {{CREATE TABLE LIKE}} via {{USING provider}}. Hive support STORED AS new file format syntax: {code} CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; {code} For Hive compatibility, we should also support {{STORED AS}} in {{CREATE TABLE LIKE}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29775) Support truncate multiple tables
[ https://issues.apache.org/jira/browse/SPARK-29775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971448#comment-16971448 ] Rakesh Raushan commented on SPARK-29775: i will work on this > Support truncate multiple tables > > > Key: SPARK-29775 > URL: https://issues.apache.org/jira/browse/SPARK-29775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > Spark sql Support truncate single table like > TRUNCATE table t1; > But postgresql support truncating multiple tables like > TRUNCATE bigtable, fattable; > So spark also can support truncating multiple tables > [https://www.postgresql.org/docs/12/sql-truncate.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29838) PostgreSQL dialect: cast to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971433#comment-16971433 ] Aman Omer commented on SPARK-29838: --- Already working on this. Thanks [~jobitmathew] > PostgreSQL dialect: cast to timestamp > - > > Key: SPARK-29838 > URL: https://issues.apache.org/jira/browse/SPARK-29838 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > > Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29838) PostgreSQL dialect: cast to timestamp
jobit mathew created SPARK-29838: Summary: PostgreSQL dialect: cast to timestamp Key: SPARK-29838 URL: https://issues.apache.org/jira/browse/SPARK-29838 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to timestamp behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29837) PostgreSQL dialect: cast to boolean
[ https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-29837: - Parent: SPARK-29836 Issue Type: Sub-task (was: Task) > PostgreSQL dialect: cast to boolean > --- > > Key: SPARK-29837 > URL: https://issues.apache.org/jira/browse/SPARK-29837 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29837) PostgreSQL dialect: cast to boolean
wuyi created SPARK-29837: Summary: PostgreSQL dialect: cast to boolean Key: SPARK-29837 URL: https://issues.apache.org/jira/browse/SPARK-29837 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29836) PostgreSQL dialect: cast
wuyi created SPARK-29836: Summary: PostgreSQL dialect: cast Key: SPARK-29836 URL: https://issues.apache.org/jira/browse/SPARK-29836 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: wuyi SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org