[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578446#comment-16578446
 ] 

Apache Spark commented on SPARK-24928:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22008

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578447#comment-16578447
 ] 

Marco Gaido commented on SPARK-24928:
-

Actually this is a duplicate of SPARK-11982, which solved the issue for the SQL 
API. For the RDD API, please be careful choosing the right side of the 
cartesian. I am closing this as a duplicate. Feel free to reopen if you think 
anything else can be done.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570437#comment-16570437
 ] 

Apache Spark commented on SPARK-24928:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22008

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-06 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570289#comment-16570289
 ] 

Marco Gaido commented on SPARK-24928:
-

[~matthewnormyle] the fix you are proposing doesn't solve the problem, but it 
returns a wrong result. The root cause of the issue here is that the for is a 
nested loop. So if the outer iterator is the small one, we build much less 
iterators than otherwise. I think that in the RDD case there is few we can do, 
while for the SQL case we can probably add an optimizer rule using the 
statistics (if they are computed).

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-03 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568893#comment-16568893
 ] 

Matthew Normyle commented on SPARK-24928:
-

In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};
{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#33} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
{color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Seems to resolve this issue.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-03 Thread LIFULONG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567986#comment-16567986
 ] 

LIFULONG commented on SPARK-24928:
--

for (x <- rdd1.iterator(currSplit.s1, context);
 y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)

 

code from CartesianRDD.compute() method,looks like it will load right rdd from 
text for each record in left rdd.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-02 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567578#comment-16567578
 ] 

Matthew Normyle commented on SPARK-24928:
-

{color:#cc7832}val {color}largeRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1000{color})(Random.nextInt))
{color:#cc7832}val {color}smallRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1{color})(Random.nextInt))

*(1)* largeRDD.cartesian(smallRDD).count()

*(2)* smallRDD.cartesian(largeRDD).count()

 

Building from master, I can see that (1) consistently takes about twice as long 
as (2) on my machine.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558287#comment-16558287
 ] 

Marco Gaido commented on SPARK-24928:
-

The affected version is pretty old, can you check a newer version?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org