[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817924#comment-15817924
 ] 

Sean Owen commented on SPARK-18857:
---

I think it's probably OK, if it's a significant problem, and we have a 
targeted, tested fix here.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817178#comment-15817178
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Or, could you cherry-pick that please?
When I try to cherry-pick for branch-2.0 and branch-2.1, there was no problem.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817170#comment-15817170
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Hi, [~srowen].
This is a bug existing 2.0.2 and 2.1.X.
I'll create a backport for this issue.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793668#comment-15793668
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Thank you for testing and confirming!

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-02 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15792731#comment-15792731
 ] 

vishal agrawal commented on SPARK-18857:


Thanks. its working fine now for our scenario.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-01 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15792152#comment-15792152
 ] 

vishal agrawal commented on SPARK-18857:


thanks. we will test it and confirm.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2017-01-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15791397#comment-15791397
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Hi [~vishalagrwal].
Could you test your case with https://github.com/apache/spark/pull/16440 ?
Although I tried to address the iterator issue you mentioned, it's a memory 
issue.
So, I'm not sure the other parts still consume lots of memory in your case.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15789403#comment-15789403
 ] 

Sean Owen commented on SPARK-18857:
---

CC [~alicegugu]

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787622#comment-15787622
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Hi, [~vishalagrwal].
I agree with you. This is an important problem.
At least, I made a PR as a first attempt. In any ways, I hope this will be 
resolved soon.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787618#comment-15787618
 ] 

Apache Spark commented on SPARK-18857:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16440

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787428#comment-15787428
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Thank you for testing and sharing that information!

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-26 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778011#comment-15778011
 ] 

vishal agrawal commented on SPARK-18857:


we have built Spark from 2.0.2 source code by changing 
SparkExecuteStatementOperation.scala to pre SPARK-16563 version. this version 
works fine without causing any thrift server issues.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-18 Thread vishal agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15760203#comment-15760203
 ] 

vishal agrawal commented on SPARK-18857:


We are unable to use incremental collect in a spark version before 2.0.2 due 
the bug spark-18009

We will have to take 2.0.2 and change this particular class and build from 
source code.

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode

2016-12-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754166#comment-15754166
 ] 

Dongjoon Hyun commented on SPARK-18857:
---

Thank you for reporting, [~vishalagrwal].
Then, in the Spark side, could you test on Spark 2.0.0 before SPARK-16563?

> SparkSQL ThriftServer hangs while extracting huge data volumes in incremental 
> collect mode
> --
>
> Key: SPARK-18857
> URL: https://issues.apache.org/jira/browse/SPARK-18857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: vishal agrawal
> Attachments: GC-spark-1.6.3, GC-spark-2.0.2
>
>
> We are trying to run a sql query on our spark cluster and extracting around 
> 200 million records through SparkSQL ThriftServer interface. This query works 
> fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs 
> after fetching data from a few partitions (we are using incremental collect 
> mode with 400 partitions). As per documentation max memory taken up by thrift 
> server should be what is required by the biggest data partition. But we 
> observed that Thrift server is not releasing the old partitions memory 
> whenever the GC occurs even though it has moved to next partition data 
> fetches. which is not the case with 1.6.3 version.
> On further investigation we found that SparkExecuteStatementOperation.scala 
> was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults 
> bug" and result set iterator was duplicated to keep a reference to the first 
> set.
> +  val (itra, itrb) = iter.duplicate
> +  iterHeader = itra
> +  iter = itrb
> We suspect that this is resulting in the memory not being cleared on GC. To 
> confirm this we created an iterator in our test class and fetched the data 
> once without duplicating and second time with creating a duplicate. we could 
> see that in first instance it ran fine and fetched the entire data set while 
> in second instance driver hanged after fetching data from a few partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org