[jira] [Comment Edited] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568893#comment-16568893
 ] 

Matthew Normyle edited comment on SPARK-24928 at 8/6/18 5:05 PM:
-

In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};{color} y <- 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#33} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
 {color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Edit: Seems to speed up the computation.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?


was (Author: matthewnormyle):
In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};
{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#33} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
{color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Seems to resolve this issue.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-03 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568893#comment-16568893
 ] 

Matthew Normyle commented on SPARK-24928:
-

In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};
{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#33} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
{color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Seems to resolve this issue.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-02 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567578#comment-16567578
 ] 

Matthew Normyle commented on SPARK-24928:
-

{color:#cc7832}val {color}largeRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1000{color})(Random.nextInt))
{color:#cc7832}val {color}smallRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1{color})(Random.nextInt))

*(1)* largeRDD.cartesian(smallRDD).count()

*(2)* smallRDD.cartesian(largeRDD).count()

 

Building from master, I can see that (1) consistently takes about twice as long 
as (2) on my machine.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org