[ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12617.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.1
                   1.5.3
                   2.0.0

Issue resolved by pull request 10579
[https://github.com/apache/spark/pull/10579]

> socket descriptor leak killing streaming app
> --------------------------------------------
>
>                 Key: SPARK-12617
>                 URL: https://issues.apache.org/jira/browse/SPARK-12617
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.5.2
>         Environment: pyspark (python 2.6)
>            Reporter: Antony Mayi
>            Assignee: Shixiong Zhu
>            Priority: Critical
>             Fix For: 2.0.0, 1.5.3, 1.6.1
>
>         Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when 
> configured with batch interval more then 30 seconds. This is due to default 
> timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
> seconds of inactivity and creates new one next time. That connection doesn't 
> get closed on the python CallbackServer side and keep piling up until it 
> eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
> connections of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770  das   94u  IPv4 138824906      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
> python2.6 39770  das   95u  IPv4 138867747      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
> python2.6 39770  das   96u  IPv4 138831829      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
> python2.6 39770  das   97u  IPv4 138890524      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
> python2.6 39770  das   98u  IPv4 138860190      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
> python2.6 39770  das   99u  IPv4 138860439      0t0       TCP 
> localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually 
> become unable to accept new connections from the gateway and the app will 
> crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1451711400000 ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at java.net.Socket.connect(Socket.java:538)
>         at java.net.Socket.<init>(Socket.java:434)
>         at java.net.Socket.<init>(Socket.java:244)
>         at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to