[jira] [Commented] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147241#comment-17147241 ] Xiaochen Ouyang commented on SPARK-31095: - Hello [~dongjoon], Can netty-all upgrade solve CVE-2020-9480 security vulnerability metioned on the Spark official website? Thanks! > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Assignee: Dongjoon Hyun >Priority: Major > Labels: security > Fix For: 2.4.6, 3.0.0 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time
[ https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-32112: -- Description: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks Notes: - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed by Spark apps. - `SparkContext.getExecutorMemoryStatus` might help to calculate the number of available slots to process tasks. was: Repartition/coalesce is very important to optimize Spark application's performance, however, a lot of users are struggling with determining the number of partitions. This issue is to add a easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time. It will help Spark users to determine the optimal number of partitions. Expected use-cases: - repartition with the calculated parallel tasks There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by Spark apps. > Easier way to repartition/coalesce DataFrames based on the number of parallel > tasks that Spark can process at the same time > --- > > Key: SPARK-32112 > URL: https://issues.apache.org/jira/browse/SPARK-32112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Noritaka Sekiyama >Priority: Major > > Repartition/coalesce is very important to optimize Spark application's > performance, however, a lot of users are struggling with determining the > number of partitions. > This issue is to add a easier way to repartition/coalesce DataFrames based > on the number of parallel tasks that Spark can process at the same time. > It will help Spark users to determine the optimal number of partitions. > Expected use-cases: > - repartition with the calculated parallel tasks > Notes: > - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed > by Spark apps. > - `SparkContext.getExecutorMemoryStatus` might help to calculate the number > of available slots to process tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: huber.xlsx > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 50 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: (was: huber.xlsx) > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 50 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Description: |performace test in https://issues.apache.org/jira/browse/SPARK-31783, Huber loss seems start to diverge since 70 iters. {code:java} for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { Thread.sleep(1) val hlir = new LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) val start = System.currentTimeMillis val model = hlir.setBlockSize(size).fit(df) val end = System.currentTimeMillis println((model.uid, size, iter, end - start, model.summary.objectiveHistory.last, model.summary.totalIterations, model.coefficients.toString.take(100))) }{code}| | | | | | | | | | | | | | | | | |result:| |blockSize=1| |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| blockSize=4| |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| blockSize=16| |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| blockSize=64| |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| was: |performace test in https://issues.apache.org/jira/browse/SPARK-31783, Huber loss seems start to diverge since 50 iters. {code:java} for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { Thread.sleep(1) val hlir = new LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) val start = System.currentTimeMillis val model = hlir.setBlockSize(size).fit(df) val end = System.currentTimeMillis println((model.uid, size, iter, end - start, model.summary.objectiveHistory.last, model.summary.totalIterations, model.coefficients.toString.take(100))) }{code}| | | | | | | | | | | | | | | | | |result:| |blockSize=1| |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| blockSize=4| |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| blockSize=16| |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| blockSize=64|
[jira] [Commented] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147261#comment-17147261 ] zhengruifeng commented on SPARK-32060: -- {code:java} import org.apache.spark.ml.regression._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t") df.persist(StorageLevel.MEMORY_AND_DISK) df.countval lir = new LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) } {code} model coef: {code:java} scala> results.map(_._2.coefficients).foreach(coef => println(coef.toString.take(200))) [-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0 [-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0 [0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0. [0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0 [0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809 [-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242 [-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065 {code} objectiveHistory is also attached > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: huber.xlsx > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147261#comment-17147261 ] zhengruifeng edited comment on SPARK-32060 at 6/28/20, 8:34 AM: {code:java} import org.apache.spark.ml.regression._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t") df.persist(StorageLevel.MEMORY_AND_DISK) df.countval lir = new LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber") val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) } {code} model coef: {code:java} scala> results.map(_._2.coefficients).foreach(coef => println(coef.toString.take(200))) [-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0 [-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0 [0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0. [0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0 [0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809 [-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242 [-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065 {code} objectiveHistory is also attached was (Author: podongfeng): {code:java} import org.apache.spark.ml.regression._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t") df.persist(StorageLevel.MEMORY_AND_DISK) df.countval lir = new LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) } {code} model coef: {code:java} scala> results.map(_._2.coefficients).foreach(coef => println(coef.toString.take(200))) [-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0 [-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0 [0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0. [0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0 [0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809 [-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242 [-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065 {code} objectiveHistory is also attached > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx
[jira] [Comment Edited] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147261#comment-17147261 ] zhengruifeng edited comment on SPARK-32060 at 6/28/20, 8:34 AM: {code:java} import org.apache.spark.ml.regression._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t") df.persist(StorageLevel.MEMORY_AND_DISK) df.count val lir = new LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber") val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) } {code} model coef: {code:java} scala> results.map(_._2.coefficients).foreach(coef => println(coef.toString.take(200))) [-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0 [-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0 [0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0. [0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0 [0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809 [-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242 [-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065 {code} objectiveHistory is also attached was (Author: podongfeng): {code:java} import org.apache.spark.ml.regression._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t") df.persist(StorageLevel.MEMORY_AND_DISK) df.countval lir = new LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber") val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) } {code} model coef: {code:java} scala> results.map(_._2.coefficients).foreach(coef => println(coef.toString.take(200))) [-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0 [-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0 [0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0. [0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0 [0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809 [-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242 [-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065 {code} objectiveHistory is also attached > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.x
[jira] [Commented] (SPARK-32108) Silent mode of spark-sql is broken
[ https://issues.apache.org/jira/browse/SPARK-32108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147264#comment-17147264 ] Lantao Jin commented on SPARK-32108: [~maxgekk] I think it works. The INFO logs only print in spark-sql starting. > Silent mode of spark-sql is broken > -- > > Key: SPARK-32108 > URL: https://issues.apache.org/jira/browse/SPARK-32108 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > 1. I download the recent release Spark 3.0 from > http://spark.apache.org/downloads.html > 2. Run bin/spark-sql -S, it prints a lot of INFO > {code} > ➜ ~ ./spark-3.0/bin/spark-sql -S > 20/06/26 20:43:38 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.hive.conf.HiveConf). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 20/06/26 20:43:39 INFO SharedState: spark.sql.warehouse.dir is not set, but > hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the > value of hive.metastore.warehouse.dir ('/user/hive/warehouse'). > 20/06/26 20:43:39 INFO SharedState: Warehouse path is '/user/hive/warehouse'. > 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: > /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a > 20/06/26 20:43:39 INFO SessionState: Created local directory: > /var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a > 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: > /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a/_tmp_space.db > 20/06/26 20:43:39 INFO SparkContext: Running Spark version 3.0.0 > 20/06/26 20:43:39 INFO ResourceUtils: > == > 20/06/26 20:43:39 INFO ResourceUtils: Resources for spark.driver: > 20/06/26 20:43:39 INFO ResourceUtils: > == > 20/06/26 20:43:39 INFO SparkContext: Submitted application: > SparkSQL::192.168.1.78 > 20/06/26 20:43:39 INFO SecurityManager: Changing view acls to: maximgekk > 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls to: maximgekk > 20/06/26 20:43:39 INFO SecurityManager: Changing view acls groups to: > 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls groups to: > 20/06/26 20:43:39 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(maximgekk); > groups with view permissions: Set(); users with modify permissions: > Set(maximgekk); groups with modify permissions: Set() > 20/06/26 20:43:39 INFO Utils: Successfully started service 'sparkDriver' on > port 59414. > 20/06/26 20:43:39 INFO SparkEnv: Registering MapOutputTracker > 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMaster > 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMasterHeartbeat > 20/06/26 20:43:39 INFO DiskBlockManager: Created local directory at > /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/blockmgr-c1d041ad-dd46-4d11-bbd0-e8ba27d3bf69 > 20/06/26 20:43:39 INFO MemoryStore: MemoryStore started with capacity 408.9 > MiB > 20/06/26 20:43:39 INFO SparkEnv: Registering OutputCommitCoordinator > 20/06/26 20:43:40 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 20/06/26 20:43:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://192.168.1.78:4040 > 20/06/26 20:43:40 INFO Executor: Starting executor ID driver on host > 192.168.1.78 > 20/06/26 20:43:40 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59415. > 20/06/26 20:43:40 INFO NettyBlockTransferService: Server created on > 192.168.1.78:59415 > 20/06/26 20:43:40 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 20/06/26 20:43:40 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, 192.168.1.78, 59415, None) > 20/06/26 20:43:40 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.1.78:59415 with 408.9 MiB RAM, BlockManagerId(driver, 192.168.1.78, > 59415, None) > 20/06/26 20:43:40 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, 192.168.1.
[jira] [Commented] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147271#comment-17147271 ] Manish Khobragade commented on SPARK-31851: --- I would also like to help with this. > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32117) Thread spark-listener-group-streams is cpu costing
Lantao Jin created SPARK-32117: -- Summary: Thread spark-listener-group-streams is cpu costing Key: SPARK-32117 URL: https://issues.apache.org/jira/browse/SPARK-32117 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Lantao Jin In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing even though in a non-streaming application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32117) Thread spark-listener-group-streams is cpu costing
[ https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147278#comment-17147278 ] Lantao Jin commented on SPARK-32117: I think it might be fixed by SPARK-29423 > Thread spark-listener-group-streams is cpu costing > -- > > Key: SPARK-32117 > URL: https://issues.apache.org/jira/browse/SPARK-32117 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing > even though in a non-streaming application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32117) Thread spark-listener-group-streams is cpu costing
[ https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin resolved SPARK-32117. Resolution: Won't Fix > Thread spark-listener-group-streams is cpu costing > -- > > Key: SPARK-32117 > URL: https://issues.apache.org/jira/browse/SPARK-32117 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing > even though in a non-streaming application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: huber.xlsx > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: (was: huber.xlsx) > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-32060: - Attachment: image-2020-06-28-18-05-28-867.png > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx, image-2020-06-28-18-05-28-867.png > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32060) Huber loss Convergence
[ https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147295#comment-17147295 ] zhengruifeng commented on SPARK-32060: -- According to the convergence curves of different blockSize, the objective value start to diverge since iter=70, but finally convenge to 0.67~0.68 at iter=200; As to the solutions, the coefficient looks different. refer to [https://en.wikipedia.org/wiki/Least_absolute_deviations:] *L1-Loss is robust, but not stable (Possibly multiple solutions); L2-Loss is Not very robust, but is stable (Always one solution)* Huber is a mix of both L1-Loss and L2-Loss: at each iteration, some instances are used with L1-Loss, while others with L2-Loss. So I personally think Huber is between L1-Loss and L2-Loss, that there maybe multiple solutions in Huber Regression. ping [~weichenxu123] > Huber loss Convergence > -- > > Key: SPARK-32060 > URL: https://issues.apache.org/jira/browse/SPARK-32060 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > Attachments: huber.xlsx, image-2020-06-28-18-05-28-867.png > > > |performace test in https://issues.apache.org/jira/browse/SPARK-31783, > Huber loss seems start to diverge since 70 iters. > {code:java} > for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) { > Thread.sleep(1) > val hlir = new > LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0) > val start = System.currentTimeMillis > val model = hlir.setBlockSize(size).fit(df) > val end = System.currentTimeMillis > println((model.uid, size, iter, end - start, > model.summary.objectiveHistory.last, model.summary.totalIterations, > model.coefficients.toString.take(100))) > }{code}| > | | > | | > | | > | | > | | > | | > | | > | | > |result:| > |blockSize=1| > |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)| > |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)| > |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)| > blockSize=4| > |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)| > |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)| > |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)| > blockSize=16| > |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)| > |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)| > |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)| > blockSize=64| > |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)| > |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)| > |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147313#comment-17147313 ] angerszhu commented on SPARK-32018: --- [~allisonwang-db] Can you show a test case to reproduce this? > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Priority: Major > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32109) SQL hash function handling of nulls makes collision too likely
[ https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147328#comment-17147328 ] Chen Zhang commented on SPARK-32109: The logic in the source code can be represented by the following pseudocode. {code:scala} def computeHash(value: Any, hashSeed: Long): Long = { value match { case null => hashSeed case b: Boolean => hashInt(if (b) 1 else 0, hashSeed) // Murmur3Hash case i: Int => hashInt(i, hashSeed) ... } } val seed = 42L var hash = seed var i = 0 val len = columns.length while (i < len) { hash = computeHash(columns(i).value, hash) i += 1 } hash {code} I can solve this problem by modifying the following code. (eval function and doGenCode function in org.apache.spark.sql.catalyst.expressions.HashExpression class) {code:scala} override def eval(input: InternalRow = null): Any = { var hash = seed var i = 0 val len = children.length while (i < len) { //hash = computeHash(children(i).eval(input), children(i).dataType, hash) hash = (31 * hash) + computeHash(children(i).eval(input), children(i).dataType, hash) i += 1 } hash } {code} But I don't think it's necessary to modify the code, and if we do, it will affect the existing data distribution. > SQL hash function handling of nulls makes collision too likely > -- > > Key: SPARK-32109 > URL: https://issues.apache.org/jira/browse/SPARK-32109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: koert kuipers >Priority: Minor > > this ticket is about org.apache.spark.sql.functions.hash and sparks handling > of nulls when hashing sequences. > {code:java} > scala> spark.sql("SELECT hash('bar', null)").show() > +---+ > |hash(bar, NULL)| > +---+ > |-1808790533| > +---+ > scala> spark.sql("SELECT hash(null, 'bar')").show() > +---+ > |hash(NULL, bar)| > +---+ > |-1808790533| > +---+ > {code} > these are differences sequences. e.g. these could be positions 0 and 1 in a > dataframe which are diffferent columns with entirely different meanings. the > hashes should not be the same. > another example: > {code:java} > scala> Seq(("john", null), (null, "john")).toDF("name", > "alias").withColumn("hash", hash(col("name"), col("alias"))).show > ++-+-+ > |name|alias| hash| > ++-+-+ > |john| null|487839701| > |null| john|487839701| > ++-+-+ {code} > instead of ignoring nulls each null show do a transform to the hash so that > the order of elements including the nulls matters for the outcome. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog
Lantao Jin created SPARK-32118: -- Summary: Use fine-grained read write lock for each database in HiveExternalCatalog Key: SPARK-32118 URL: https://issues.apache.org/jira/browse/SPARK-32118 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Lantao Jin In HiveExternalCatalog, all metastore operations are synchronized by a same object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's queries may be stuck by any a long operation. For example, if a user is accessing a table which contains mass partitions, the operation loadDynamicPartitions() holds the object lock for a long time. All queries are blocking to wait for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32118: Assignee: Apache Spark > Use fine-grained read write lock for each database in HiveExternalCatalog > - > > Key: SPARK-32118 > URL: https://issues.apache.org/jira/browse/SPARK-32118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > In HiveExternalCatalog, all metastore operations are synchronized by a same > object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's > queries may be stuck by any a long operation. For example, if a user is > accessing a table which contains mass partitions, the operation > loadDynamicPartitions() holds the object lock for a long time. All queries > are blocking to wait for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147353#comment-17147353 ] Apache Spark commented on SPARK-32118: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/28938 > Use fine-grained read write lock for each database in HiveExternalCatalog > - > > Key: SPARK-32118 > URL: https://issues.apache.org/jira/browse/SPARK-32118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In HiveExternalCatalog, all metastore operations are synchronized by a same > object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's > queries may be stuck by any a long operation. For example, if a user is > accessing a table which contains mass partitions, the operation > loadDynamicPartitions() holds the object lock for a long time. All queries > are blocking to wait for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32118: Assignee: (was: Apache Spark) > Use fine-grained read write lock for each database in HiveExternalCatalog > - > > Key: SPARK-32118 > URL: https://issues.apache.org/jira/browse/SPARK-32118 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > In HiveExternalCatalog, all metastore operations are synchronized by a same > object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's > queries may be stuck by any a long operation. For example, if a user is > accessing a table which contains mass partitions, the operation > loadDynamicPartitions() holds the object lock for a long time. All queries > are blocking to wait for the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32109) SQL hash function handling of nulls makes collision too likely
[ https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147358#comment-17147358 ] koert kuipers edited comment on SPARK-32109 at 6/28/20, 2:58 PM: - the issue is that row here isnt really a sequence. it represent an object. if you have say an object Person(name: String, nickname: String) you would not want Person("john", null) and Person(null, "john") to have same hashCode. see for example the suggested hashcode implementations in effective java by joshua bloch. they do something similar to what you suggest to solve this problem. so unfortunately i think our current implementation is flawed :( p.s. even for pure sequences i do not think this implementation as it is right now is acceptable. but that is less of a worry than the object represenation of row. was (Author: koert): the issue is that Row here isnt really a sequence. it represent an object. if you have say an object Person(name: String, nickname: String) you would not want Person("john", null) and Person(null, "john") to have same hashCode. see for example the suggested hashcode implementations in effective java by joshua bloch. they do something similar to what you suggest to solve this problem. so unfortunately i think our current implementation is flawed :( PS even for pure sequences i do not think this implementation as it is right now is acceptable. but that is less of a worry than the object represenation of row. > SQL hash function handling of nulls makes collision too likely > -- > > Key: SPARK-32109 > URL: https://issues.apache.org/jira/browse/SPARK-32109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: koert kuipers >Priority: Minor > > this ticket is about org.apache.spark.sql.functions.hash and sparks handling > of nulls when hashing sequences. > {code:java} > scala> spark.sql("SELECT hash('bar', null)").show() > +---+ > |hash(bar, NULL)| > +---+ > |-1808790533| > +---+ > scala> spark.sql("SELECT hash(null, 'bar')").show() > +---+ > |hash(NULL, bar)| > +---+ > |-1808790533| > +---+ > {code} > these are differences sequences. e.g. these could be positions 0 and 1 in a > dataframe which are diffferent columns with entirely different meanings. the > hashes should not be the same. > another example: > {code:java} > scala> Seq(("john", null), (null, "john")).toDF("name", > "alias").withColumn("hash", hash(col("name"), col("alias"))).show > ++-+-+ > |name|alias| hash| > ++-+-+ > |john| null|487839701| > |null| john|487839701| > ++-+-+ {code} > instead of ignoring nulls each null show do a transform to the hash so that > the order of elements including the nulls matters for the outcome. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32109) SQL hash function handling of nulls makes collision too likely
[ https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147358#comment-17147358 ] koert kuipers commented on SPARK-32109: --- the issue is that Row here isnt really a sequence. it represent an object. if you have say an object Person(name: String, nickname: String) you would not want Person("john", null) and Person(null, "john") to have same hashCode. see for example the suggested hashcode implementations in effective java by joshua bloch. they do something similar to what you suggest to solve this problem. so unfortunately i think our current implementation is flawed :( PS even for pure sequences i do not think this implementation as it is right now is acceptable. but that is less of a worry than the object represenation of row. > SQL hash function handling of nulls makes collision too likely > -- > > Key: SPARK-32109 > URL: https://issues.apache.org/jira/browse/SPARK-32109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: koert kuipers >Priority: Minor > > this ticket is about org.apache.spark.sql.functions.hash and sparks handling > of nulls when hashing sequences. > {code:java} > scala> spark.sql("SELECT hash('bar', null)").show() > +---+ > |hash(bar, NULL)| > +---+ > |-1808790533| > +---+ > scala> spark.sql("SELECT hash(null, 'bar')").show() > +---+ > |hash(NULL, bar)| > +---+ > |-1808790533| > +---+ > {code} > these are differences sequences. e.g. these could be positions 0 and 1 in a > dataframe which are diffferent columns with entirely different meanings. the > hashes should not be the same. > another example: > {code:java} > scala> Seq(("john", null), (null, "john")).toDF("name", > "alias").withColumn("hash", hash(col("name"), col("alias"))).show > ++-+-+ > |name|alias| hash| > ++-+-+ > |john| null|487839701| > |null| john|487839701| > ++-+-+ {code} > instead of ignoring nulls each null show do a transform to the hash so that > the order of elements including the nulls matters for the outcome. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
Kousuke Saruta created SPARK-32119: -- Summary: ExecutorPlugin doesn't work with Standalone Cluster Key: SPARK-32119 URL: https://issues.apache.org/jira/browse/SPARK-32119 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster manager too except YARN. ) when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works **with YARN because jars/files are distributed as distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32119: --- Description: ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster manager too except YARN. ) when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works with YARN because jars/files are distributed as distributed cache. was: ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster manager too except YARN. ) when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works **with YARN because jars/files are distributed as distributed cache. > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147413#comment-17147413 ] Dongjoon Hyun commented on SPARK-32115: --- Thank you, @Yuanjian Li . > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 2.4.6 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147414#comment-17147414 ] Apache Spark commented on SPARK-32119: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28939 > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 2.3.4 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32119: Assignee: Kousuke Saruta (was: Apache Spark) > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 2.2.3 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32119: Assignee: Apache Spark (was: Kousuke Saruta) > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32120) Single GPU is allocated multiple times
Enrico Minack created SPARK-32120: - Summary: Single GPU is allocated multiple times Key: SPARK-32120 URL: https://issues.apache.org/jira/browse/SPARK-32120 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.0.0 Reporter: Enrico Minack Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task and executor and two GPUs provided through a GPU discovery script, the same GPU is allocated to both executors. Discovery script output: {code} {"name": "gpu", "addresses": ["0", "1"]} {code} Spark local cluster setup through `spark-shell`: {code} ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" --conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.executor.resource.gpu.amount=1 {code} Executor of this cluster: Code run in the Spark shell: {code} scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, v.addresses)).iterator } fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] scala> spark.range(0,2,1,2).mapPartitions(fn).collect res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), (gpu,(gpu,Array(1 {code} The result shows that each task got GPU {{1}}. The executor page shows that each task has been run on different executors: The expected behaviour would have been to have GPU `0` assigned to one executor and GPU {{1}} to the other executor. Consequently, each partition / task should then see a different GPU. With Spark 3.0.0-preview2 the allocation was as expected (identical code and Spark shell setup): {code} res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), (gpu,(gpu,Array(1 {code} Happy to contribute a patch if this is an accepted bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times
[ https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-32120: -- Attachment: screenshot-1.png > Single GPU is allocated multiple times > -- > > Key: SPARK-32120 > URL: https://issues.apache.org/jira/browse/SPARK-32120 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > Attachments: screenshot-1.png > > > Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task > and executor and two GPUs provided through a GPU discovery script, the same > GPU is allocated to both executors. > Discovery script output: > {code} > {"name": "gpu", "addresses": ["0", "1"]} > {code} > Spark local cluster setup through `spark-shell`: > {code} > ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master > "local-cluster[2,1,1024]" --conf > spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf > spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 > --conf spark.executor.resource.gpu.amount=1 > {code} > Executor of this cluster: > Code run in the Spark shell: > {code} > scala> import org.apache.spark.TaskContext > import org.apache.spark.TaskContext > scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, > Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, > v.addresses)).iterator } > fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] > scala> spark.range(0,2,1,2).mapPartitions(fn).collect > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), > (gpu,(gpu,Array(1 > {code} > The result shows that each task got GPU {{1}}. The executor page shows that > each task has been run on different executors: > The expected behaviour would have been to have GPU `0` assigned to one > executor and GPU {{1}} to the other executor. Consequently, each partition / > task should then see a different GPU. > With Spark 3.0.0-preview2 the allocation was as expected (identical code and > Spark shell setup): > {code} > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), > (gpu,(gpu,Array(1 > {code} > Happy to contribute a patch if this is an accepted bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 2.1.3 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Labels: correctness (was: ) > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 2.0.2 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times
[ https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-32120: -- Attachment: screenshot-2.png > Single GPU is allocated multiple times > -- > > Key: SPARK-32120 > URL: https://issues.apache.org/jira/browse/SPARK-32120 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task > and executor and two GPUs provided through a GPU discovery script, the same > GPU is allocated to both executors. > Discovery script output: > {code} > {"name": "gpu", "addresses": ["0", "1"]} > {code} > Spark local cluster setup through `spark-shell`: > {code} > ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master > "local-cluster[2,1,1024]" --conf > spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf > spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 > --conf spark.executor.resource.gpu.amount=1 > {code} > Executor of this cluster: > Code run in the Spark shell: > {code} > scala> import org.apache.spark.TaskContext > import org.apache.spark.TaskContext > scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, > Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, > v.addresses)).iterator } > fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] > scala> spark.range(0,2,1,2).mapPartitions(fn).collect > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), > (gpu,(gpu,Array(1 > {code} > The result shows that each task got GPU {{1}}. The executor page shows that > each task has been run on different executors: > The expected behaviour would have been to have GPU `0` assigned to one > executor and GPU {{1}} to the other executor. Consequently, each partition / > task should then see a different GPU. > With Spark 3.0.0-preview2 the allocation was as expected (identical code and > Spark shell setup): > {code} > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), > (gpu,(gpu,Array(1 > {code} > Happy to contribute a patch if this is an accepted bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times
[ https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-32120: -- Attachment: screenshot-3.png > Single GPU is allocated multiple times > -- > > Key: SPARK-32120 > URL: https://issues.apache.org/jira/browse/SPARK-32120 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task > and executor and two GPUs provided through a GPU discovery script, the same > GPU is allocated to both executors. > Discovery script output: > {code} > {"name": "gpu", "addresses": ["0", "1"]} > {code} > Spark local cluster setup through `spark-shell`: > {code} > ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master > "local-cluster[2,1,1024]" --conf > spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf > spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 > --conf spark.executor.resource.gpu.amount=1 > {code} > Executor of this cluster: > Code run in the Spark shell: > {code} > scala> import org.apache.spark.TaskContext > import org.apache.spark.TaskContext > scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, > Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, > v.addresses)).iterator } > fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] > scala> spark.range(0,2,1,2).mapPartitions(fn).collect > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), > (gpu,(gpu,Array(1 > {code} > The result shows that each task got GPU {{1}}. The executor page shows that > each task has been run on different executors: > The expected behaviour would have been to have GPU `0` assigned to one > executor and GPU {{1}} to the other executor. Consequently, each partition / > task should then see a different GPU. > With Spark 3.0.0-preview2 the allocation was as expected (identical code and > Spark shell setup): > {code} > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), > (gpu,(gpu,Array(1 > {code} > Happy to contribute a patch if this is an accepted bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Affects Version/s: 1.6.3 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147416#comment-17147416 ] Dongjoon Hyun commented on SPARK-32115: --- I also verified that this is a long standing bug at 1.6.3 ~ 3.0.0 and Apache Hive 2.3.7 has no problem. {code} hive> SELECT SUBSTRING("abc", -1207959552, -1207959552); OK Time taken: 4.291 seconds, Fetched: 1 row(s) {code} > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Priority: Blocker (was: Major) > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Blocker > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32115: -- Target Version/s: 2.4.7, 3.0.1 > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Blocker > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147417#comment-17147417 ] Dongjoon Hyun commented on SPARK-32115: --- Although this might be a rare case, but I raise this issue as a Blocker issue because this is a correctness issue . > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Priority: Major > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times
[ https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-32120: -- Description: I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task and executor, and two GPUs provided through a GPU discovery script. The same GPU is allocated to both executors. Discovery script output: {code:java} {"name": "gpu", "addresses": ["0", "1"]} {code} Spark local cluster setup through {{spark-shell}}: {code:java} ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" --conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.executor.resource.gpu.amount=1 {code} Executor page of this cluster: !screenshot-2.png! You can see that both executors have the same GPU allocated: {{[1]}} Code run in the Spark shell: {code:java} scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, v.addresses)).iterator } fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] scala> spark.range(0,2,1,2).mapPartitions(fn).collect res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), (gpu,(gpu,Array(1 {code} The result shows that each task got GPU {{1}}. The executor page shows that each task has been run on different executors (see above screenshot). The expected behaviour would have been to have GPU {{0}} assigned to one executor and GPU {{1}} to the other executor. Consequently, each partition / task should then see a different GPU. With Spark 3.0.0-preview2 the allocation was as expected (identical code and Spark shell setup): {code:java} res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), (gpu,(gpu,Array(1 {code} !screenshot-3.png! Happy to contribute a patch if this is an accepted bug. was: Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task and executor and two GPUs provided through a GPU discovery script, the same GPU is allocated to both executors. Discovery script output: {code} {"name": "gpu", "addresses": ["0", "1"]} {code} Spark local cluster setup through `spark-shell`: {code} ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" --conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.executor.resource.gpu.amount=1 {code} Executor of this cluster: Code run in the Spark shell: {code} scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, v.addresses)).iterator } fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] scala> spark.range(0,2,1,2).mapPartitions(fn).collect res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), (gpu,(gpu,Array(1 {code} The result shows that each task got GPU {{1}}. The executor page shows that each task has been run on different executors: The expected behaviour would have been to have GPU `0` assigned to one executor and GPU {{1}} to the other executor. Consequently, each partition / task should then see a different GPU. With Spark 3.0.0-preview2 the allocation was as expected (identical code and Spark shell setup): {code} res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), (gpu,(gpu,Array(1 {code} Happy to contribute a patch if this is an accepted bug. > Single GPU is allocated multiple times > -- > > Key: SPARK-32120 > URL: https://issues.apache.org/jira/browse/SPARK-32120 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > Attachments: screenshot-2.png, screenshot-3.png > > > I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, > task and executor, and two GPUs provided through a GPU discovery script. The > same GPU is allocated to both executors. > Discovery script output: > {code:java} > {"name": "gpu", "addresses": ["0", "1"]} > {code} > Spark local cluster setup through {{spark-shell}}: > {code:java} > ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master > "local-cluster[2,1,1024]" --conf > spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf > spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 > --conf spark.executor.resource.gpu.amount=1 > {code} > Executor page of this cluster: >
[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times
[ https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-32120: -- Attachment: (was: screenshot-1.png) > Single GPU is allocated multiple times > -- > > Key: SPARK-32120 > URL: https://issues.apache.org/jira/browse/SPARK-32120 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > Attachments: screenshot-2.png, screenshot-3.png > > > I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, > task and executor, and two GPUs provided through a GPU discovery script. The > same GPU is allocated to both executors. > Discovery script output: > {code:java} > {"name": "gpu", "addresses": ["0", "1"]} > {code} > Spark local cluster setup through {{spark-shell}}: > {code:java} > ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master > "local-cluster[2,1,1024]" --conf > spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf > spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 > --conf spark.executor.resource.gpu.amount=1 > {code} > Executor page of this cluster: > !screenshot-2.png! > You can see that both executors have the same GPU allocated: {{[1]}} > Code run in the Spark shell: > {code:java} > scala> import org.apache.spark.TaskContext > import org.apache.spark.TaskContext > scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, > Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, > v.addresses)).iterator } > fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))] > scala> spark.range(0,2,1,2).mapPartitions(fn).collect > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), > (gpu,(gpu,Array(1 > {code} > The result shows that each task got GPU {{1}}. The executor page shows that > each task has been run on different executors (see above screenshot). > The expected behaviour would have been to have GPU {{0}} assigned to one > executor and GPU {{1}} to the other executor. Consequently, each partition / > task should then see a different GPU. > With Spark 3.0.0-preview2 the allocation was as expected (identical code and > Spark shell setup): > {code:java} > res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), > (gpu,(gpu,Array(1 > {code} > !screenshot-3.png! > Happy to contribute a patch if this is an accepted bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on windows
Cheng Pan created SPARK-32121: - Summary: ExternalShuffleBlockResolverSuite failed on windows Key: SPARK-32121 URL: https://issues.apache.org/jira/browse/SPARK-32121 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.0.0, 3.0.1 Environment: Windows 10 Reporter: Cheng Pan The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} should consider the Windows file separator. {code} [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 s <<< FAILURE! - in org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite [ERROR] testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) Time elapsed: 0 s <<< FAILURE! org.junit.ComparisonFailure: expected: but was: at org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32115. --- Fix Version/s: 3.1.0 2.4.7 3.0.1 Resolution: Fixed Issue resolved by pull request 28937 [https://github.com/apache/spark/pull/28937] > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > Fix For: 3.0.1, 2.4.7, 3.1.0 > > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32115) Incorrect results for SUBSTRING when overflow
[ https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32115: - Assignee: Yuanjian Li > Incorrect results for SUBSTRING when overflow > - > > Key: SPARK-32115 > URL: https://issues.apache.org/jira/browse/SPARK-32115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Blocker > Labels: correctness > > SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly > returns "abc" against expected output of "". > This is a result of integer overflow in addition > [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147422#comment-17147422 ] Dongjoon Hyun commented on SPARK-32119: --- Hi, [~sarutak]. This sounds like a bug for `Standalone Cluster`. Can we switch this to `BUG` instead of `Improvement` for 3.1.0? > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Pan updated SPARK-32121: -- Summary: ExternalShuffleBlockResolverSuite failed on Windows (was: ExternalShuffleBlockResolverSuite failed on windows) > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Priority: Minor > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32119: --- Issue Type: Bug (was: Improvement) > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32119: --- Affects Version/s: 3.0.1 > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147427#comment-17147427 ] Kousuke Saruta commented on SPARK-32119: Sorry, it's just a mistake. I've modified it. > ExecutorPlugin doesn't work with Standalone Cluster > --- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster > manager too except YARN. ) > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147424#comment-17147424 ] Apache Spark commented on SPARK-32121: -- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/28940 > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Priority: Minor > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32121: Assignee: (was: Apache Spark) > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Priority: Minor > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32121: Assignee: Apache Spark > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Minor > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files
[ https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147434#comment-17147434 ] Apache Spark commented on SPARK-25341: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/28941 > Support rolling back a shuffle map stage and re-generate the shuffle files > -- > > Key: SPARK-25341 > URL: https://issues.apache.org/jira/browse/SPARK-25341 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 > To completely fix that problem, Spark needs to be able to rollback a shuffle > map stage and rerun all the map tasks. > According to https://github.com/apache/spark/pull/9214 , Spark doesn't > support it currently, as in shuffle writing "first write wins". > Since overwriting shuffle files is hard, we can extend the shuffle id to > include a "shuffle generation number". Then the reduce task can specify which > generation of shuffle it wants to read. > https://github.com/apache/spark/pull/6648 seems in the right direction. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files
[ https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147435#comment-17147435 ] Apache Spark commented on SPARK-25341: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/28941 > Support rolling back a shuffle map stage and re-generate the shuffle files > -- > > Key: SPARK-25341 > URL: https://issues.apache.org/jira/browse/SPARK-25341 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 > To completely fix that problem, Spark needs to be able to rollback a shuffle > map stage and rerun all the map tasks. > According to https://github.com/apache/spark/pull/9214 , Spark doesn't > support it currently, as in shuffle writing "first write wins". > Since overwriting shuffle files is hard, we can extend the shuffle id to > include a "shuffle generation number". Then the reduce task can specify which > generation of shuffle it wants to read. > https://github.com/apache/spark/pull/6648 seems in the right direction. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32122) Exception while writing dataframe with enum fields
Sai kiran Krishna murthy created SPARK-32122: Summary: Exception while writing dataframe with enum fields Key: SPARK-32122 URL: https://issues.apache.org/jira/browse/SPARK-32122 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.3 Reporter: Sai kiran Krishna murthy I have an avro schema with one field which is an enum and I am trying to enforce this schema when I am writing my dataframe, the code looks something like this {code:java} case class Name1(id:String,count:Int,val_type:String) val schema = """{ | "type" : "record", | "name" : "name1", | "namespace" : "com.data", | "fields" : [ | { |"name" : "id", |"type" : "string" | }, | { |"name" : "count", |"type" : "int" | }, | { |"name" : "val_type", |"type" : { | "type" : "enum", | "name" : "ValType", | "symbols" : [ "s1", "s2" ] |} | } | ] |}""".stripMargin val df = Seq( Name1("1",2,"s1"), Name1("1",3,"s2"), Name1("1",4,"s2"), Name1("11",2,"s1")).toDF() df.write.format("avro").option("avroSchema",schema).save("data/tes2/") {code} This code fails with the following exception, {noformat} 2020-06-28 23:28:10 ERROR Utils:91 - Aborting task org.apache.avro.AvroRuntimeException: Not a union: "string" at org.apache.avro.Schema.getTypes(Schema.java:299) at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) at org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:208) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.avro.AvroSerializer.newStructConverter(AvroSerializer.scala:208) at org.apache.spark.sql.avro.AvroSerializer.(AvroSerializer.scala:51) at org.apache.spark.sql.avro.AvroOutputWriter.serializer$lzycompute(AvroOutputWriter.scala:42) at org.apache.spark.sql.avro.AvroOutputWriter.serializer(AvroOutputWriter.scala:42) at org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:64) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-06-28 23:28:10 ERROR Utils:91 - Aborting task{noformat} I understand this is because of the type of val_type is `String` in the case class. Can you please advice how I can solve this problem without having to change the underlying avro schema? Thanks! -- This message w
[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toby Harradine updated SPARK-32123: --- Affects Version/s: (was: 2.3.1) 3.0.0 > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-32123 > URL: https://issues.apache.org/jira/browse/SPARK-32123 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Toby Harradine >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
Toby Harradine created SPARK-32123: -- Summary: [Python] Setting `spark.sql.session.timeZone` only partially respected Key: SPARK-32123 URL: https://issues.apache.org/jira/browse/SPARK-32123 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Toby Harradine The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toby Harradine updated SPARK-32123: --- Description: Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. was: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-32123 > URL: https://issues.apache.org/jira/browse/SPARK-32123 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Toby Harradine >Priority: Major > > Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take int
[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toby Harradine updated SPARK-32123: --- Labels: (was: bulk-closed) > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-32123 > URL: https://issues.apache.org/jira/browse/SPARK-32123 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Toby Harradine >Priority: Major > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toby Harradine updated SPARK-32123: --- Description: Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. was: Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-32123 > URL: https://issues.apache.org/jira/browse/SPARK-32123 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Toby Harradine >Priority: Major > > Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the >
[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toby Harradine updated SPARK-32123: --- Description: Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. The setting {{spark.sql.session.timeZone}} is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons {{datetime}} objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method {{toPandas}} respected the timezone setting (UTC), but the method {{collect}} ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods {{toInternal}} and {{fromInternal}} of PySparks {{TimestampType}} class don't take into account the setting {{spark.sql.session.timeZone}} and use the system timezone. was: Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-32123 > URL: https://issues.apache.org/jira/browse/SPARK-32123 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Toby Harradine >Priority: Major > > Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0. > The setting {{spark.sql.session.timeZone}} is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons {{datetime}} > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method {{toPandas}} respected the timezone setting (UTC), but the > method {{collect}} ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods {{toInternal}} and > {{fromInternal}} of PySparks {{TimestampType}} class don't take into account > the setting {{spark.sql.session.timeZone}} and use the system tim
[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147464#comment-17147464 ] Toby Harradine commented on SPARK-25244: Thanks for letting me know. I've just created SPARK-32123 which marks affected version as 3.0.0. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147464#comment-17147464 ] Toby Harradine edited comment on SPARK-25244 at 6/28/20, 10:44 PM: --- Thanks for letting me know. I've just created SPARK-32123 which marks affected version as 3.0.0. Reproduction steps and analysis is the same as it is here. was (Author: toby.harradine): Thanks for letting me know. I've just created SPARK-32123 which marks affected version as 3.0.0. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
Zhongwei Zhu created SPARK-32124: Summary: [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4 Key: SPARK-32124 URL: https://issues.apache.org/jira/browse/SPARK-32124 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Zhongwei Zhu When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to due to missing field "Map Index". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147465#comment-17147465 ] Apache Spark commented on SPARK-32124: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/28941 > [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by > Spark 2.4 > > > Key: SPARK-32124 > URL: https://issues.apache.org/jira/browse/SPARK-32124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to > due to missing field "Map Index". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32124: Assignee: (was: Apache Spark) > [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by > Spark 2.4 > > > Key: SPARK-32124 > URL: https://issues.apache.org/jira/browse/SPARK-32124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to > due to missing field "Map Index". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32124: Assignee: Apache Spark > [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by > Spark 2.4 > > > Key: SPARK-32124 > URL: https://issues.apache.org/jira/browse/SPARK-32124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to > due to missing field "Map Index". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
Zhongwei Zhu created SPARK-32125: Summary: [UI] Support get taskList by status in Web UI and SHS Rest API Key: SPARK-32125 URL: https://issues.apache.org/jira/browse/SPARK-32125 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Zhongwei Zhu Support fetching taskList by status as below: /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147469#comment-17147469 ] Apache Spark commented on SPARK-32125: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/28942 > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147470#comment-17147470 ] Apache Spark commented on SPARK-32125: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/28942 > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32125: Assignee: (was: Apache Spark) > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API
[ https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32125: Assignee: Apache Spark > [UI] Support get taskList by status in Web UI and SHS Rest API > -- > > Key: SPARK-32125 > URL: https://issues.apache.org/jira/browse/SPARK-32125 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > Support fetching taskList by status as below: > /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function
[ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147504#comment-17147504 ] L. C. Hsieh commented on SPARK-32096: - Does a filter of the window rank (e.g. rank <= 100) mean top-100 sort? A such filter means the rows with rank <= 100 for each window partition. In each physical partition, it could contain many window partitions. The filter predicate needs to be applied on each window partition. > Support top-N sort for Spark SQL rank window function > - > > Key: SPARK-32096 > URL: https://issues.apache.org/jira/browse/SPARK-32096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Any environment that supports Spark. >Reporter: Zikun >Priority: Major > > In Spark SQL, there are two types of sort execution, *_SortExec_* and > *_TakeOrderedAndProjectExec_* . > *_SortExec_* is a general sorting execution and it does not support top-N > sort. > *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. > Spark SQL rank window function needs to sort the data locally and it relies > on the execution plan *_SortExec_* to sort the data in each physical data > partition. When the filter of the window rank (e.g. rank <= 100) is specified > in a user's query, the filter can actually be pushed down to the SortExec and > then we let SortExec operates top-N sort. > Right now SortExec does not support top-N sort and we need to extend the > capability of SortExec to support top-N sort. > Or if SortExec is not considered as the right execution choice, we can create > a new execution plan called topNSortExec to do top-N sort in each local > partition if a filter on the window rank is specified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32124: - Assignee: Zhongwei Zhu > [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by > Spark 2.4 > > > Key: SPARK-32124 > URL: https://issues.apache.org/jira/browse/SPARK-32124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > > When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to > due to missing field "Map Index". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32124. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28941 [https://github.com/apache/spark/pull/28941] > [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by > Spark 2.4 > > > Key: SPARK-32124 > URL: https://issues.apache.org/jira/browse/SPARK-32124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to > due to missing field "Map Index". > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function
[ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147538#comment-17147538 ] Zikun commented on SPARK-32096: --- [~viirya] Yes, a filter of window rank <= 100 means a top-100 sort. And yes again for your second statement. The filter predicate needs to be applied on each window partition. > Support top-N sort for Spark SQL rank window function > - > > Key: SPARK-32096 > URL: https://issues.apache.org/jira/browse/SPARK-32096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Any environment that supports Spark. >Reporter: Zikun >Priority: Major > > In Spark SQL, there are two types of sort execution, *_SortExec_* and > *_TakeOrderedAndProjectExec_* . > *_SortExec_* is a general sorting execution and it does not support top-N > sort. > *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. > Spark SQL rank window function needs to sort the data locally and it relies > on the execution plan *_SortExec_* to sort the data in each physical data > partition. When the filter of the window rank (e.g. rank <= 100) is specified > in a user's query, the filter can actually be pushed down to the SortExec and > then we let SortExec operates top-N sort. > Right now SortExec does not support top-N sort and we need to extend the > capability of SortExec to support top-N sort. > Or if SortExec is not considered as the right execution choice, we can create > a new execution plan called topNSortExec to do top-N sort in each local > partition if a filter on the window rank is specified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32126) Scope Session.active in IncrementalExecution
Dongjoon Hyun created SPARK-32126: - Summary: Scope Session.active in IncrementalExecution Key: SPARK-32126 URL: https://issues.apache.org/jira/browse/SPARK-32126 Project: Spark Issue Type: Bug Components: SQL, Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32126: -- Component/s: (was: SQL) > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32126: Assignee: (was: Apache Spark) > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147539#comment-17147539 ] Apache Spark commented on SPARK-32126: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/28936 > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32126: Assignee: Apache Spark > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32126. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28936 [https://github.com/apache/spark/pull/28936] > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function
[ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147542#comment-17147542 ] L. C. Hsieh commented on SPARK-32096: - Then I think it is not a simply top-N sort... You need to do top-N sort for each window partition in each physical partition. > Support top-N sort for Spark SQL rank window function > - > > Key: SPARK-32096 > URL: https://issues.apache.org/jira/browse/SPARK-32096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Any environment that supports Spark. >Reporter: Zikun >Priority: Major > > In Spark SQL, there are two types of sort execution, *_SortExec_* and > *_TakeOrderedAndProjectExec_* . > *_SortExec_* is a general sorting execution and it does not support top-N > sort. > *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. > Spark SQL rank window function needs to sort the data locally and it relies > on the execution plan *_SortExec_* to sort the data in each physical data > partition. When the filter of the window rank (e.g. rank <= 100) is specified > in a user's query, the filter can actually be pushed down to the SortExec and > then we let SortExec operates top-N sort. > Right now SortExec does not support top-N sort and we need to extend the > capability of SortExec to support top-N sort. > Or if SortExec is not considered as the right execution choice, we can create > a new execution plan called topNSortExec to do top-N sort in each local > partition if a filter on the window rank is specified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32126: - Assignee: Yuanjian Li > Scope Session.active in IncrementalExecution > > > Key: SPARK-32126 > URL: https://issues.apache.org/jira/browse/SPARK-32126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function
[ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147548#comment-17147548 ] Zikun commented on SPARK-32096: --- Yes, we need to do top-N sort for each window partition in each physical partition. > Support top-N sort for Spark SQL rank window function > - > > Key: SPARK-32096 > URL: https://issues.apache.org/jira/browse/SPARK-32096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Any environment that supports Spark. >Reporter: Zikun >Priority: Major > > In Spark SQL, there are two types of sort execution, *_SortExec_* and > *_TakeOrderedAndProjectExec_* . > *_SortExec_* is a general sorting execution and it does not support top-N > sort. > *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. > Spark SQL rank window function needs to sort the data locally and it relies > on the execution plan *_SortExec_* to sort the data in each physical data > partition. When the filter of the window rank (e.g. rank <= 100) is specified > in a user's query, the filter can actually be pushed down to the SortExec and > then we let SortExec operates top-N sort. > Right now SortExec does not support top-N sort and we need to extend the > capability of SortExec to support top-N sort. > Or if SortExec is not considered as the right execution choice, we can create > a new execution plan called topNSortExec to do top-N sort in each local > partition if a filter on the window rank is specified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32096) Support top-N sort for Spark SQL rank window function
[ https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147548#comment-17147548 ] Zikun edited comment on SPARK-32096 at 6/29/20, 4:44 AM: - Yes, we need to do top-N sort for each window partition in each physical partition. And I think this is doable. We are working on a POC of this improvement. was (Author: xuzikun2003): Yes, we need to do top-N sort for each window partition in each physical partition. > Support top-N sort for Spark SQL rank window function > - > > Key: SPARK-32096 > URL: https://issues.apache.org/jira/browse/SPARK-32096 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: Any environment that supports Spark. >Reporter: Zikun >Priority: Major > > In Spark SQL, there are two types of sort execution, *_SortExec_* and > *_TakeOrderedAndProjectExec_* . > *_SortExec_* is a general sorting execution and it does not support top-N > sort. > *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. > Spark SQL rank window function needs to sort the data locally and it relies > on the execution plan *_SortExec_* to sort the data in each physical data > partition. When the filter of the window rank (e.g. rank <= 100) is specified > in a user's query, the filter can actually be pushed down to the SortExec and > then we let SortExec operates top-N sort. > Right now SortExec does not support top-N sort and we need to extend the > capability of SortExec to support top-N sort. > Or if SortExec is not considered as the right execution choice, we can create > a new execution plan called topNSortExec to do top-N sort in each local > partition if a filter on the window rank is specified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32090) UserDefinedType.equal() does not have symmetry
[ https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32090. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28923 [https://github.com/apache/spark/pull/28923] > UserDefinedType.equal() does not have symmetry > --- > > Key: SPARK-32090 > URL: https://issues.apache.org/jira/browse/SPARK-32090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass > val udt1 = new ExampleBaseTypeUDT > val udt2 = new ExampleSubTypeUDT > println(udt1 == udt2) // true > println(udt2 == udt1) // false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32090) UserDefinedType.equal() does not have symmetry
[ https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32090: - Assignee: wuyi > UserDefinedType.equal() does not have symmetry > --- > > Key: SPARK-32090 > URL: https://issues.apache.org/jira/browse/SPARK-32090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass > val udt1 = new ExampleBaseTypeUDT > val udt2 = new ExampleSubTypeUDT > println(udt1 == udt2) // true > println(udt2 == udt1) // false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children
[ https://issues.apache.org/jira/browse/SPARK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-32127: Description: [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {{MergeAction.children}} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {{MergeAction.condition}} instead. was: [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {MergeAction.children} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {MergeAction.condition} instead. > Check rules for MERGE INTO should use MergeAction.condition other than > MeregAction.children > --- > > Key: SPARK-32127 > URL: https://issues.apache.org/jira/browse/SPARK-32127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some > check rules for MERGE INTO one of which ensures the first MATCHED clause must > have a condition. However, it uses {{MergeAction.children}} in the checking > which is not accurate for the case, and it lets the below case pass the check: > {code:scala} > MERGE INTO testcat1.ns1.ns2.tbl AS target > xxx > WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 > WHEN MATCHED THEN DELETE > xxx > {code} > We should use {{MergeAction.condition}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children
Xianyin Xin created SPARK-32127: --- Summary: Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children Key: SPARK-32127 URL: https://issues.apache.org/jira/browse/SPARK-32127 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {MergeAction.children} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {MergeAction.condition} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MergeAction.children
[ https://issues.apache.org/jira/browse/SPARK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-32127: Summary: Check rules for MERGE INTO should use MergeAction.condition other than MergeAction.children (was: Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children) > Check rules for MERGE INTO should use MergeAction.condition other than > MergeAction.children > --- > > Key: SPARK-32127 > URL: https://issues.apache.org/jira/browse/SPARK-32127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some > check rules for MERGE INTO one of which ensures the first MATCHED clause must > have a condition. However, it uses {{MergeAction.children}} in the checking > which is not accurate for the case, and it lets the below case pass the check: > {code:scala} > MERGE INTO testcat1.ns1.ns2.tbl AS target > xxx > WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 > WHEN MATCHED THEN DELETE > xxx > {code} > We should use {{MergeAction.condition}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MergeAction.children
[ https://issues.apache.org/jira/browse/SPARK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32127: Assignee: (was: Apache Spark) > Check rules for MERGE INTO should use MergeAction.condition other than > MergeAction.children > --- > > Key: SPARK-32127 > URL: https://issues.apache.org/jira/browse/SPARK-32127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some > check rules for MERGE INTO one of which ensures the first MATCHED clause must > have a condition. However, it uses {{MergeAction.children}} in the checking > which is not accurate for the case, and it lets the below case pass the check: > {code:scala} > MERGE INTO testcat1.ns1.ns2.tbl AS target > xxx > WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 > WHEN MATCHED THEN DELETE > xxx > {code} > We should use {{MergeAction.condition}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org