[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10189:
--
Target Version/s:   (was: 1.4.1)

> python rdd socket connection problem
> 
>
> Key: SPARK-10189
> URL: https://issues.apache.org/jira/browse/SPARK-10189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: ABHISHEK CHOUDHARY
>  Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the 
> same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 1277, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py",
>  line 898, in runJob
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 138, in _load_from_socket
> raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> try:
> sock = socket.socket(af, socktype, proto)
> sock.settimeout(3)
> sock.connect(sa)
> except socket.error:
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> try:
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is 
> not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: ABHISHEK CHOUDHARY
  Labels: pyspark, socket

 I am trying to use wholeTextFiles with pyspark , and now I am getting the 
 same error -
 ```
 textFiles = 

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
{code:title=context.py|borderStyle=solid}
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
{code}

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: 

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark