[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15528038#comment-15528038
 ] 

Apache Spark commented on SPARK-17666:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15271

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>Priority: Critical
> Fix For: 2.1.0
>
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-27 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527994#comment-15527994
 ] 

Reynold Xin commented on SPARK-17666:
-

This still needs to be backported to branch-2.0.


> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>Priority: Critical
> Fix For: 2.1.0
>
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524108#comment-15524108
 ] 

Apache Spark commented on SPARK-17666:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15245

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523967#comment-15523967
 ] 

Sean Owen commented on SPARK-17666:
---

Agree, it might happen to help some cases or even this one but can't help all 
cases. Does a finalize() help at all? only if somehow the iterator were 
regularly GCed well before the underlying stream or something, but I don't see 
how.

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523921#comment-15523921
 ] 

Josh Rosen commented on SPARK-17666:


I think that one problem with that approach is that any transformation of a 
CompletionIterator will yield a new iterator which doesn't forward the close() 
call (i.e. if you {{map}} over the iterator then it will break your proposed 
check in {{Task}}).

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523799#comment-15523799
 ] 

Sean Owen commented on SPARK-17666:
---

Hm, I wonder if a couple problems of this form could be solved by handling the 
case of a CompletionIterator in SparkContext.runJob(), which is the thing that 
would eventually get a CompletionIterator from something and process it, if 
anything, in this case. It could call completion() when it knows it's done. I 
might give that a shot to see if it works at all; not sure that's the right fix.

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523742#comment-15523742
 ] 

Josh Rosen commented on SPARK-17666:


My hunch is that there's cleanup which is performed in a {{CompletionIterator}} 
once that iterator is completely consumed, but that there is not final "safety 
net" cleanup logic to ensure cleanup if the iterator is _not_ fully-consumed 
(which happens in take).

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523303#comment-15523303
 ] 

Igor Berman commented on SPARK-17666:
-

sorry, still no idea. What's the difference between take and count ? Why count 
is not producing such problem? So for me as "end-user" spark has a problem, 
something like "not returning connection to s3a connection pool", but I might 
be wrong. I'll try to investigate it further.

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import com.dy.sparkbi.common.S3CredentialsLoader;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523290#comment-15523290
 ] 

Sean Owen commented on SPARK-17666:
---

No, I mean what about take() do you believe leaks a connection? is it a Spark 
problem or S3 problem?

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import com.dy.sparkbi.common.S3CredentialsLoader;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523286#comment-15523286
 ] 

Igor Berman commented on SPARK-17666:
-

[~sowen], yes, I believe so
rdd.isEmpty() uses take inside as far as I understand


> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.dy.sparkbi.experiment.compaction;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import com.dy.sparkbi.common.S3CredentialsLoader;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523280#comment-15523280
 ] 

Sean Owen commented on SPARK-17666:
---

Where are you saying the leak is, from take()?

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.dy.sparkbi.experiment.compaction;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import com.dy.sparkbi.common.S3CredentialsLoader;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org