[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568893#comment-16568893 ] Matthew Normyle edited comment on SPARK-24928 at 8/6/18 5:05 PM: - In CartesianRDD.compute, changing: {color:#cc7832}for {color}(x <- rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context){color:#cc7832};{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) {color:#33} to:{color} {color:#cc7832}val {color}it1 = rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context) {color:#cc7832}val {color}it2 = rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context) {color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) Edit: Seems to speed up the computation. I am brand new to Scala and Spark. Does anyone have any insight as to why this seemingly superficial change could make such a large difference? was (Author: matthewnormyle): In CartesianRDD.compute, changing: {color:#cc7832}for {color}(x <- rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context){color:#cc7832}; {color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) {color:#33} to:{color} {color:#cc7832}val {color}it1 = rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context) {color:#cc7832}val {color}it2 = rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context) {color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) Seems to resolve this issue. I am brand new to Scala and Spark. Does anyone have any insight as to why this seemingly superficial change could make such a large difference? > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568893#comment-16568893 ] Matthew Normyle commented on SPARK-24928: - In CartesianRDD.compute, changing: {color:#cc7832}for {color}(x <- rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context){color:#cc7832}; {color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) {color:#33} to:{color} {color:#cc7832}val {color}it1 = rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context) {color:#cc7832}val {color}it2 = rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context) {color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) Seems to resolve this issue. I am brand new to Scala and Spark. Does anyone have any insight as to why this seemingly superficial change could make such a large difference? > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567578#comment-16567578 ] Matthew Normyle commented on SPARK-24928: - {color:#cc7832}val {color}largeRDD = sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1000{color})(Random.nextInt)) {color:#cc7832}val {color}smallRDD = sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1{color})(Random.nextInt)) *(1)* largeRDD.cartesian(smallRDD).count() *(2)* smallRDD.cartesian(largeRDD).count() Building from master, I can see that (1) consistently takes about twice as long as (2) on my machine. > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org