Re: Pyspark Error

2014-11-18 Thread Shannon Quinn
My best guess would be a networking issue--it looks like the Python 
socket library isn't able to connect to whatever hostname you're 
providing Spark in the configuration.


On 11/18/14 9:10 AM, amin mohebbi wrote:

Hi there,

*I have already downloaded Pre-built spark-1.1.0, I want to run 
pyspark by try typing ./bin/pyspark but I got the following error:*

*
*







*scala shell is up and working fine*

hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

.
.
14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver

14/11/18 04:33:13 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala hduser@master:~/Downloads/spark-1.1.0$


*
*
*But python shell does not work:*

hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$
hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark
Python 2.7.3 (default, Feb 27 2014, 20:00:17)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser,
14/11/18 04:36:06 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users with view 
permissions: Set(hduser, ); users with modify permissions: Set(hduser, )

14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started
14/11/18 04:36:06 INFO Remoting: Starting remoting
14/11/18 04:36:06 INFO Remoting: Remoting started; listening on 
addresses :[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@master:52317]
14/11/18 04:36:06 INFO Utils: Successfully started service 
'sparkDriver' on port 52317.

14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker
14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster
14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20141118043606-c346
14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection 
manager for block manager' on port 47507.
14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 
with id = ConnectionManagerId(master,47507)
14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 
267.3 MB

14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager
14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block 
manager master:47507 with 267.3 MB RAM

14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager
14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65

14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server
14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file 
server' on port 40029.
14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' 
on port 4040.
14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040 
http://master:4040/
14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver
14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at 
http://master:4040 http://master:4040/

14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler
14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: 
MapOutputTrackerActor stopped!

14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted!
14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped
14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared
14/11/18 04:36:13 INFO BlockManager: BlockManager stopped
14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.

14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remote daemon shut down; proceeding with flushing remote transports.

14/11/18 04:36:13 INFO Remoting: Remoting shut down
14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.

Traceback (most recent call last):
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, 
line 44, in module

sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 107, in __init__

conf)
  File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, 
line 159, in _do_init

self._accumulatorServer = accumulators._start_update_server()
  File 

Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn

Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its own 
issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential edits 
to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) line; 
the creepy part is that it says the crash happens in func2 with the 
broadcast variable INDEX == 1 (it attempts to access an entry that 
doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about how 
Spark works under the hood?


Thanks for your help!

Shannon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn
To clarify about what, precisely, is impossible: the crash happens with 
INDEX == 1 in func2, but func2 is only called in the reduceByKey 
transformation when INDEX == 0. And according to the output of the 
foreach() in line 4, that reduceByKey(func2) works just fine. How is it 
then invoked again with INDEX == 1 when there clearly isn't another 
reduce call at line 7?


On 11/18/14 1:58 PM, Shannon Quinn wrote:

Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its 
own issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential 
edits to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) line; 
the creepy part is that it says the crash happens in func2 with the 
broadcast variable INDEX == 1 (it attempts to access an entry that 
doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about 
how Spark works under the hood?


Thanks for your help!

Shannon



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Iterative transformations over RDD crashes in phantom reduce

2014-11-18 Thread Shannon Quinn
Sorry everyone--turns out an oft-forgotten single line of code was 
required to make this work:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
*M.cache()*
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

Works as expected now, and I understand why it was failing before: Spark 
was trying to recompute the RDD but consequently it was invoked with 
index == 1.


On 11/18/14 2:02 PM, Shannon Quinn wrote:
To clarify about what, precisely, is impossible: the crash happens 
with INDEX == 1 in func2, but func2 is only called in the reduceByKey 
transformation when INDEX == 0. And according to the output of the 
foreach() in line 4, that reduceByKey(func2) works just fine. How is 
it then invoked again with INDEX == 1 when there clearly isn't another 
reduce call at line 7?


On 11/18/14 1:58 PM, Shannon Quinn wrote:

Hi all,

This is somewhat related to my previous question ( 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html 
, for additional context) but for all practical purposes this is its 
own issue.


As in my previous question, I'm making iterative changes to an RDD, 
where each iteration depends on the results of the previous one. I've 
stripped down what was previously a loop to just be two sequential 
edits to try and nail down where the problem is. It looks like this:


index = 0
INDEX = sc.broadcast(index)
M = M.flatMap(func1).reduceByKey(func2)
M.foreach(debug_output)
index = 1
INDEX = sc.broadcast(index)
M = M.flatMap(func1)
M.foreach(debug_output)

M is basically a row-indexed matrix, where each index points to a 
dictionary (sparse matrix more or less, with some domain-specific 
modifications). This program crashes on the second-to-last (7th) 
line; the creepy part is that it says the crash happens in func2 
with the broadcast variable INDEX == 1 (it attempts to access an 
entry that doesn't exist in a dictionary of one of the rows).


How is that even possible? Am I missing something fundamental about 
how Spark works under the hood?


Thanks for your help!

Shannon






Iterative changes to RDD and broadcast variables

2014-11-16 Thread Shannon Quinn

Hi all,

I'm iterating over an RDD (representing a distributed matrix...have to 
roll my own in Python) and making changes to different submatrices at 
each iteration. The loop structure looks something like:


for i in range(x):
  VAR = sc.broadcast(i)
  rdd.map(func1).reduceByKey(func2)
M = rdd.collect()

where func1 and func2 use the current value of VAR for that iteration.

Because there aren't any actions in the main loop, nothing actually 
happens until the collect method is called. I'm running into problems 
I can't diagnose (*extremely* long execution time for no particular 
reason, among others); is this code even valid? If not, how should make 
in-place iterative edits to different portions of a matrix, where each 
subsequent edit is dependent on the edits from the previous iteration?


Thanks in advance!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dividing tasks among Spark workers

2014-07-18 Thread Shannon Quinn

The default # of partitions is the # of cores, correct?

On 7/18/14, 10:53 AM, Yanbo Liang wrote:

check how many partitions in your program.
If only one, change it to more partitions will make the execution 
parallel.



2014-07-18 20:57 GMT+08:00 Madhura das.madhur...@gmail.com 
mailto:das.madhur...@gmail.com:


I am running my program on a spark cluster but when I look into my
UI while
the job is running I see that only one worker does most of the
tasks. My
cluster has one master and 4 workers where the master is also a
worker.

I want my task to complete as quickly as possible and I believe
that if the
number of tasks were to be divided equally among the workers, the
job will
be completed faster.

Is there any way I can customize the umber of job on each worker?

http://apache-spark-user-list.1001560.n3.nabble.com/file/n10160/Question.png



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Dividing-tasks-among-Spark-workers-tp10160.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.






Re: Python: saving/reloading RDD

2014-07-18 Thread Shannon Quinn
+1, had to learn this the hard way when some of my objects were written 
as pointers, rather than translated correctly to strings :)


On 7/18/14, 11:52 AM, Xiangrui Meng wrote:

You can save RDDs to text files using RDD.saveAsTextFile and load it back using 
sc.textFile. But make sure the record to string conversion is correctly 
implemented if the type is not primitive and you have the parser to load them 
back. -Xiangrui


On Jul 18, 2014, at 8:39 AM, Roch Denis rde...@exostatic.com wrote:

Hello,

Just to make sure I correctly read the doc and the forums. It's my
understanding that currently in python with Spark 1.0.1 there is no way to
save my RDD to disk that I can just reload. The hadoop RDD are not yet
present in Python.

Is that correct? I just want to make sure that's the case before I write a
workaround.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Job aborted due to stage failure: TID x failed for unknown reasons

2014-07-18 Thread Shannon Quinn

Hi all,

I'm dealing with some strange error messages that I *think* comes down 
to a memory issue, but I'm having a hard time pinning it down and could 
use some guidance from the experts.


I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; 
one has 16GB memory, the other 32GB (which is the master). My 
application involves computing pairwise pixel affinities in images, 
though the images I've tested so far only get as big as 1920x1200, and 
as small as 16x16.


I did have to change a few memory and parallelism settings, otherwise I 
was getting explicit OutOfMemoryExceptions. In spark-default.conf:


spark.executor.memory14g
spark.default.parallelism32
spark.akka.frameSize1000

In spark-env.sh:

SPARK_DRIVER_MEMORY=10G

With those settings, however, I get a bunch of WARN statements about 
Lost TIDs (no task is successfully completed) in addition to lost 
Executors, which are repeated 4 times until I finally get the following 
error message and crash:


---

14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0
14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at 
/home/user/Programming/PySpark-Affinities/affinity.py:243

Traceback (most recent call last):
  File /home/user/Programming/PySpark-Affinities/affinity.py, line 
243, in module

lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]])
  File 
/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py, 
line 583, in collect

bytesInJava = self._jrdd.collect().iterator()
  File 
/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, 
line 537, in __call__
  File 
/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, 
line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0.0:13 failed 4 times, most recent failure: *TID 32 on host 
master.host.univ.edu failed for unknown reason*

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4)
14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove 
executor 4 from BlockManagerMaster.
14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in 
removeExecutor

user@master:~/Programming/PySpark-Affinities$

---

If I run the really small image instead (16x16), it *appears* to run to 
completion (gives me the output I expect without any exceptions being 
thrown). However, in the stderr logs for the app that was run, it lists 
the state as KILLED with the final message a ERROR 
CoarseGrainedExecutorBackend: Driver Disassociated. If I run any larger 
images, I get the exception I pasted above.


Furthermore, if I just do a spark-submit with master=local[*], aside 
from still needing to set the aforementioned memory options, it will 
work for an image of *any* size (I've tested both machines 
independently; they both do this when running as local[*]), whereas 
working on a cluster will result in the aforementioned crash at stage 0 
with anything but the smallest images.


Any ideas what is going on?

Thank you very much in advance!

Regards,

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
I put the settings as you specified in spark-env.sh for the master. When 
I run start-all.sh, the web UI shows both the worker on the master 
(machine1) and the slave worker (machine2) as ALIVE and ready, with the 
master URL at spark://192.168.1.101. However, when I run spark-submit, 
it immediately crashes with


py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
error: [Startup failed]

akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to 
/192.168.1.101:5060

[...]
java.net.BindException: Address already in use.
[...]

This seems entirely contrary to intuition; why would Spark be unable to 
bind to the exact IP:port set for the master?


On 6/27/14, 1:54 AM, Akhil Das wrote:

Hi Shannon,

How about a setting like the following? (just removed the quotes)

export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1

Not sure whats happening in your case, it could be that your system is 
not able to bind to 192.168.1.101 address. What is the spark:// master 
url that you are seeing there in the webUI? (It should be 
spark://192.168.1.101:7077 in your case).




Thanks
Best Regards


On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


In the interest of completeness, this is how I invoke spark:

[on master]

 sbin/start-all.sh
 spark-submit --py-files extra.py main.py

iPhone'd

On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu
mailto:squ...@gatech.edu wrote:


My *best guess* (please correct me if I'm wrong) is that the
master (machine1) is sending the command to the worker (machine2)
with the localhost argument as-is; that is, machine2 isn't doing
any weird address conversion on its end.

Consequently, I've been focusing on the settings of the
master/machine1. But I haven't found anything to indicate where
the localhost argument could be coming from. /etc/hosts lists
only 127.0.0.1 as localhost; spark-defaults.conf list
spark.master as the full IP address (not 127.0.0.1); spark-env.sh
on the master also lists the full IP under SPARK_MASTER_IP. The
*only* place on the master where it's associated with localhost
is SPARK_LOCAL_IP.

In looking at the logs of the worker spawned on master, it's also
receiving a spark://localhost:5060 argument, but since it
resides on the master that works fine. Is it possible that the
master is, for some reason, passing
spark://{SPARK_LOCAL_IP}:5060 to the workers?

That was my motivation behind commenting out SPARK_LOCAL_IP;
however, that's when the master crashes immediately due to the
address already being in use.

Any ideas? Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:

Can you paste your spark-env.sh file?

Thanks
Best Regards


On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
squ...@gatech.edu mailto:squ...@gatech.edu wrote:

Both /etc/hosts have each other's IP addresses in them.
Telneting from machine2 to machine1 on port 5060 works just
fine.

Here's the output of lsof:

user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$
lsof -i:5060
COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java23985 user   30u  IPv6 11092354  0t0  TCP
machine1:sip (LISTEN)
java23985 user   40u  IPv6 11099560  0t0  TCP
machine1:sip-machine1:48315 (ESTABLISHED)
java23985 user   52u  IPv6 11100405  0t0  TCP
machine1:sip-machine2:54476 (ESTABLISHED)
java24157 user   40u  IPv6 11092413  0t0  TCP
machine1:48315-machine1:sip (ESTABLISHED)

Ubuntu seems to recognize 5060 as the standard port for
sip; it's not actually running anything there besides
Spark, it just does a s/5060/sip/g.

Is there something to the fact that every time I comment out
SPARK_LOCAL_IP in spark-env, it crashes immediately upon
spark-submit due to the address already being in use? Or
am I barking up the wrong tree on that one?

Thanks again for all your help; I hope we can knock this one
out.

Shannon


On 6/26/14, 9:13 AM, Akhil Das wrote:

Do you have ip machine1 in your workers
/etc/hosts also? If so try telneting from your machine2 to
machine1 on port 5060. Also make sure nothing else is
running on port 5060 other than Spark (*/lsof -i:5060/*)

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
squ...@gatech.edu mailto:squ...@gatech.edu wrote:

Still running into the same problem. /etc/hosts on the
master says

127.0.0.1localhost
ip machine1

ip

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
No joy, unfortunately. Same issue; see my previous email--still crashes 
with address already in use.


On 6/27/14, 1:54 AM, sujeetv wrote:

Try to explicitly set set the spark.driver.host property to the master's
IP.
Sujeet



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*, 
exactly as configured.


On 6/27/14, 9:07 AM, Shannon Quinn wrote:
I put the settings as you specified in spark-env.sh for the master. 
When I run start-all.sh, the web UI shows both the worker on the 
master (machine1) and the slave worker (machine2) as ALIVE and ready, 
with the master URL at spark://192.168.1.101. However, when I run 
spark-submit, it immediately crashes with


py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting 
error: [Startup failed]

akka.remote.RemoteTransportException: Startup failed
[...]
org.jboss.netty.channel.ChannelException: Failed to bind to 
/192.168.1.101:5060

[...]
java.net.BindException: Address already in use.
[...]

This seems entirely contrary to intuition; why would Spark be unable 
to bind to the exact IP:port set for the master?


On 6/27/14, 1:54 AM, Akhil Das wrote:

Hi Shannon,

How about a setting like the following? (just removed the quotes)

export SPARK_MASTER_IP=192.168.1.101
export SPARK_MASTER_PORT=5060
#export SPARK_LOCAL_IP=127.0.0.1

Not sure whats happening in your case, it could be that your system 
is not able to bind to 192.168.1.101 address. What is the spark:// 
master url that you are seeing there in the webUI? (It should be 
spark://192.168.1.101:7077 in your case).




Thanks
Best Regards


On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


In the interest of completeness, this is how I invoke spark:

[on master]

 sbin/start-all.sh
 spark-submit --py-files extra.py main.py

iPhone'd

On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu
mailto:squ...@gatech.edu wrote:


My *best guess* (please correct me if I'm wrong) is that the
master (machine1) is sending the command to the worker
(machine2) with the localhost argument as-is; that is, machine2
isn't doing any weird address conversion on its end.

Consequently, I've been focusing on the settings of the
master/machine1. But I haven't found anything to indicate where
the localhost argument could be coming from. /etc/hosts lists
only 127.0.0.1 as localhost; spark-defaults.conf list
spark.master as the full IP address (not 127.0.0.1);
spark-env.sh on the master also lists the full IP under
SPARK_MASTER_IP. The *only* place on the master where it's
associated with localhost is SPARK_LOCAL_IP.

In looking at the logs of the worker spawned on master, it's
also receiving a spark://localhost:5060 argument, but since it
resides on the master that works fine. Is it possible that the
master is, for some reason, passing
spark://{SPARK_LOCAL_IP}:5060 to the workers?

That was my motivation behind commenting out SPARK_LOCAL_IP;
however, that's when the master crashes immediately due to the
address already being in use.

Any ideas? Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:

Can you paste your spark-env.sh file?

Thanks
Best Regards


On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn
squ...@gatech.edu mailto:squ...@gatech.edu wrote:

Both /etc/hosts have each other's IP addresses in them.
Telneting from machine2 to machine1 on port 5060 works just
fine.

Here's the output of lsof:

user@machine1:~/spark/spark-1.0.0-bin-hadoop2$
mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$
lsof -i:5060
COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java23985 user   30u  IPv6 110923540t0  TCP
machine1:sip (LISTEN)
java23985 user   40u  IPv6 110995600t0  TCP
machine1:sip-machine1:48315 (ESTABLISHED)
java23985 user   52u  IPv6 111004050t0  TCP
machine1:sip-machine2:54476 (ESTABLISHED)
java24157 user   40u  IPv6 110924130t0  TCP
machine1:48315-machine1:sip (ESTABLISHED)

Ubuntu seems to recognize 5060 as the standard port for
sip; it's not actually running anything there besides
Spark, it just does a s/5060/sip/g.

Is there something to the fact that every time I comment
out SPARK_LOCAL_IP in spark-env, it crashes immediately
upon spark-submit due to the address already being in
use? Or am I barking up the wrong tree on that one?

Thanks again for all your help; I hope we can knock this
one out.

Shannon


On 6/26/14, 9:13 AM, Akhil Das wrote:

Do you have ip machine1 in your workers
/etc/hosts also? If so try telneting from your machine2 to
machine1 on port 5060. Also make sure nothing else is
running on port 5060 other than Spark (*/lsof -i:5060/*)

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn
squ...@gatech.edu mailto:squ...@gatech.edu wrote:

Still running

Re: numpy + pyspark

2014-06-27 Thread Shannon Quinn
Would deploying virtualenv on each directory on the cluster be viable? 
The dependencies would get tricky but I think this is the sort of 
situation it's built for.


On 6/27/14, 11:06 AM, Avishek Saha wrote:
I too felt the same Nick but I don't have root privileges on the 
cluster, unfortunately. Are there any alternatives?



On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com 
mailto:nick.pentre...@gmail.com wrote:


I've not tried this - but numpy is a tricky and complex package
with many dependencies on Fortran/C libraries etc. I'd say by the
time you figure out correctly deploying numpy in this manner, you
may as well have just built it into your cluster bootstrap
process, or PSSH install it on each node...


On Fri, Jun 27, 2014 at 4:58 PM, Avishek Saha
avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote:

To clarify I tried it and it almost worked -- but I am getting
some problems from the Random module in numpy. If anyone has
successfully passed a numpy module (via the --py-files option)
to spark-submit then please let me know.

Thanks !!
Avishek


On 26 June 2014 17:45, Avishek Saha avishek.s...@gmail.com
mailto:avishek.s...@gmail.com wrote:

Hi all,

Instead of installing numpy in each worker node, is it
possible to
ship numpy (via --py-files option maybe) while invoking the
spark-submit?

Thanks,
Avishek








Re: numpy + pyspark

2014-06-27 Thread Shannon Quinn
I suppose along those lines, there's also Anaconda: 
https://store.continuum.io/cshop/anaconda/


On 6/27/14, 11:13 AM, Nick Pentreath wrote:
Hadoopy uses http://www.pyinstaller.org/ to package things up into an 
executable that should be runnable without root privileges. It says it 
support numpy



On Fri, Jun 27, 2014 at 5:08 PM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


Would deploying virtualenv on each directory on the cluster be
viable? The dependencies would get tricky but I think this is the
sort of situation it's built for.


On 6/27/14, 11:06 AM, Avishek Saha wrote:

I too felt the same Nick but I don't have root privileges on the
cluster, unfortunately. Are there any alternatives?


On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com
mailto:nick.pentre...@gmail.com wrote:

I've not tried this - but numpy is a tricky and complex
package with many dependencies on Fortran/C libraries etc.
I'd say by the time you figure out correctly deploying numpy
in this manner, you may as well have just built it into your
cluster bootstrap process, or PSSH install it on each node...


On Fri, Jun 27, 2014 at 4:58 PM, Avishek Saha
avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote:

To clarify I tried it and it almost worked -- but I am
getting some problems from the Random module in numpy. If
anyone has successfully passed a numpy module (via the
--py-files option) to spark-submit then please let me know.

Thanks !!
Avishek


On 26 June 2014 17:45, Avishek Saha
avishek.s...@gmail.com mailto:avishek.s...@gmail.com
wrote:

Hi all,

Instead of installing numpy in each worker node, is
it possible to
ship numpy (via --py-files option maybe) while
invoking the
spark-submit?

Thanks,
Avishek











Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
For some reason, commenting out spark.driver.host and spark.driver.port 
fixed something...and broke something else (or at least revealed another 
problem). For reference, the only lines I have in my spark-defaults.conf 
now:


spark.app.name  myProg
spark.masterspark://192.168.1.101:5060
spark.executor.memory   8g
spark.files.overwrite   true

It starts up, but has problems with machine2. For some reason, machine2 
is having trouble communicating with *itself*. Here are the worker logs 
of one of the failures (there are 10 before it quits):


Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java -cp 
::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar 
-XX:MaxPermSize=128m -Xms8192M -Xmx8192M 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7 
machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker 
app-20140627144512-0001
14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/27 14:56:54 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] 
from Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] 
was not delivered. [10] dead letters encountered. This logging can be 
turned off or adjusted with configuration settings 
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] - 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949

]
14/06/27 14:56:54 INFO Worker: Asked to launch executor 
app-20140627144512-0001/8 for Funtown, USA
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] - 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949

]
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@machine2:48019] - 
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed 
with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@machine2:60949]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: machine2/130.49.226.148:60949

]

Port 48019 on machine2 is indeed open, connected, and listening. Any ideas?

Thanks!

Shannon

On 6/27/14, 1:54 AM, sujeetv wrote:

Try to explicitly set set the spark.driver.host property to the master's
IP.
Sujeet



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
Apologies; can you advise as to how I would check that? I can certainly 
SSH from master to machine2.


On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
Looks like your driver is not able to connect to the remote executor 
on machine2/130.49.226.148:60949 http://130.49.226.148:60949/.  Cn 
you check if the master machine can route to 130.49.226.148


Sujeet


On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


For some reason, commenting out spark.driver.host and
spark.driver.port fixed something...and broke something else (or
at least revealed another problem). For reference, the only lines
I have in my spark-defaults.conf now:

spark.app.name http://spark.app.name  myProg
spark.masterspark://192.168.1.101:5060
http://192.168.1.101:5060
spark.executor.memory   8g
spark.files.overwrite   true

It starts up, but has problems with machine2. For some reason,
machine2 is having trouble communicating with *itself*. Here are
the worker logs of one of the failures (there are 10 before it
quits):


Spark assembly has been built with Hive, including Datanucleus
jars on classpath
14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java
-cp

::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar
-XX:MaxPermSize=128m -Xms8192M -Xmx8192M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7
machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker
app-20140627144512-0001
14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
finished with state FAILED message Command exited with code 1
exitStatus 1
14/06/27 14:56:54 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from Actor[akka://sparkWorker/deadLetters] to

Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
was not delivered. [10] dead letters encountered. This logging can
be turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]
14/06/27 14:56:54 INFO Worker: Asked to launch executor
app-20140627144512-0001/8 for Funtown, USA
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]

Port 48019 on machine2 is indeed open, connected, and listening.
Any ideas?

Thanks!

Shannon

On 6/27/14, 1:54 AM, sujeetv wrote:

Try to explicitly set set the spark.driver.host property to
the master's
IP.
Sujeet



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.







Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
I switched which machine was the master and which was the dedicated 
worker, and now it works just fine. I discovered machine2 is on my 
department's DMZ; machine1 is not. I suspect the departmental firewall 
was causing problems. By moving the master to machine2, that seems to 
have solved my problems.


Thank you all very much for your help. I'm sure I'll have other 
questions soon :)


Regards,
Shannon

On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
Looks like your driver is not able to connect to the remote executor 
on machine2/130.49.226.148:60949 http://130.49.226.148:60949/.  Cn 
you check if the master machine can route to 130.49.226.148


Sujeet


On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


For some reason, commenting out spark.driver.host and
spark.driver.port fixed something...and broke something else (or
at least revealed another problem). For reference, the only lines
I have in my spark-defaults.conf now:

spark.app.name http://spark.app.name  myProg
spark.masterspark://192.168.1.101:5060
http://192.168.1.101:5060
spark.executor.memory   8g
spark.files.overwrite   true

It starts up, but has problems with machine2. For some reason,
machine2 is having trouble communicating with *itself*. Here are
the worker logs of one of the failures (there are 10 before it
quits):


Spark assembly has been built with Hive, including Datanucleus
jars on classpath
14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java
-cp

::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar
-XX:MaxPermSize=128m -Xms8192M -Xmx8192M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7
machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker
app-20140627144512-0001
14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7
finished with state FAILED message Command exited with code 1
exitStatus 1
14/06/27 14:56:54 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from Actor[akka://sparkWorker/deadLetters] to

Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003]
was not delivered. [10] dead letters encountered. This logging can
be turned off or adjusted with configuration settings
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]
14/06/27 14:56:54 INFO Worker: Asked to launch executor
app-20140627144512-0001/8 for Funtown, USA
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]
14/06/27 14:56:54 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@machine2:48019] -
[akka.tcp://sparkExecutor@machine2:60949]: Error [Association
failed with [akka.tcp://sparkExecutor@machine2:60949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@machine2:60949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: machine2/130.49.226.148:60949
http://130.49.226.148:60949
]

Port 48019 on machine2 is indeed open, connected, and listening.
Any ideas?

Thanks!

Shannon

On 6/27/14, 1:54 AM, sujeetv wrote:

Try to explicitly set set the spark.driver.host property to
the master's
IP.
Sujeet



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration

Re: Spark standalone network configuration problems

2014-06-26 Thread Shannon Quinn
My *best guess* (please correct me if I'm wrong) is that the master 
(machine1) is sending the command to the worker (machine2) with the 
localhost argument as-is; that is, machine2 isn't doing any weird 
address conversion on its end.


Consequently, I've been focusing on the settings of the master/machine1. 
But I haven't found anything to indicate where the localhost argument 
could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; 
spark-defaults.conf list spark.master as the full IP address (not 
127.0.0.1); spark-env.sh on the master also lists the full IP under 
SPARK_MASTER_IP. The *only* place on the master where it's associated 
with localhost is SPARK_LOCAL_IP.


In looking at the logs of the worker spawned on master, it's also 
receiving a spark://localhost:5060 argument, but since it resides on 
the master that works fine. Is it possible that the master is, for some 
reason, passing spark://{SPARK_LOCAL_IP}:5060 to the workers?


That was my motivation behind commenting out SPARK_LOCAL_IP; however, 
that's when the master crashes immediately due to the address already 
being in use.


Any ideas? Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:

Can you paste your spark-env.sh file?

Thanks
Best Regards


On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu 
mailto:squ...@gatech.edu wrote:


Both /etc/hosts have each other's IP addresses in them. Telneting
from machine2 to machine1 on port 5060 works just fine.

Here's the output of lsof:

user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java23985 user   30u  IPv6 11092354  0t0  TCP machine1:sip
(LISTEN)
java23985 user   40u  IPv6 11099560  0t0  TCP
machine1:sip-machine1:48315 (ESTABLISHED)
java23985 user   52u  IPv6 11100405  0t0  TCP
machine1:sip-machine2:54476 (ESTABLISHED)
java24157 user   40u  IPv6 11092413  0t0  TCP
machine1:48315-machine1:sip (ESTABLISHED)

Ubuntu seems to recognize 5060 as the standard port for sip;
it's not actually running anything there besides Spark, it just
does a s/5060/sip/g.

Is there something to the fact that every time I comment out
SPARK_LOCAL_IP in spark-env, it crashes immediately upon
spark-submit due to the address already being in use? Or am I
barking up the wrong tree on that one?

Thanks again for all your help; I hope we can knock this one out.

Shannon


On 6/26/14, 9:13 AM, Akhil Das wrote:

Do you have ip machine1 in your workers /etc/hosts
also? If so try telneting from your machine2 to machine1 on port
5060. Also make sure nothing else is running on port 5060 other
than Spark (*/lsof -i:5060/*)

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu
mailto:squ...@gatech.edu wrote:

Still running into the same problem. /etc/hosts on the master
says

127.0.0.1localhost
ipmachine1

ip is the same address set in spark-env.sh for
SPARK_MASTER_IP. Any other ideas?


On 6/26/14, 3:11 AM, Akhil Das wrote:

Hi Shannon,

It should be a configuration issue, check in your /etc/hosts
and make sure localhost is not associated with the
SPARK_MASTER_IP you provided.

Thanks
Best Regards


On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
squ...@gatech.edu mailto:squ...@gatech.edu wrote:

Hi all,

I have a 2-machine Spark network I've set up: a master
and worker on machine1, and worker on machine2. When I
run 'sbin/start-all.sh', everything starts up as it
should. I see both workers listed on the UI page. The
logs of both workers indicate successful registration
with the Spark master.

The problems begin when I attempt to submit a job: I get
an address already in use exception that crashes the
program. It says Failed to bind to  and lists the
exact port and address of the master.

At this point, the only items I have set in my
spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
(non-standard, set to 5060).

The next step I took, then, was to explicitly set
SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
the master to successfully send out the jobs; however,
it ends up canceling the stage after running this
command several times:

14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
added: app-20140625210032-/8 on
worker-20140625205623-machine2-53597 (machine2:53597)
with 8 cores
14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
Granted executor ID app-20140625210032-/8

Re: Spark standalone network configuration problems

2014-06-26 Thread Shannon Quinn
In the interest of completeness, this is how I invoke spark:

[on master]

 sbin/start-all.sh
 spark-submit --py-files extra.py main.py

iPhone'd

 On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu wrote:
 
 My *best guess* (please correct me if I'm wrong) is that the master 
 (machine1) is sending the command to the worker (machine2) with the localhost 
 argument as-is; that is, machine2 isn't doing any weird address conversion on 
 its end.
 
 Consequently, I've been focusing on the settings of the master/machine1. But 
 I haven't found anything to indicate where the localhost argument could be 
 coming from. /etc/hosts lists only 127.0.0.1 as localhost; 
 spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); 
 spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The 
 *only* place on the master where it's associated with localhost is 
 SPARK_LOCAL_IP.
 
 In looking at the logs of the worker spawned on master, it's also receiving a 
 spark://localhost:5060 argument, but since it resides on the master that 
 works fine. Is it possible that the master is, for some reason, passing 
 spark://{SPARK_LOCAL_IP}:5060 to the workers?
 
 That was my motivation behind commenting out SPARK_LOCAL_IP; however, 
 that's when the master crashes immediately due to the address already being 
 in use.
 
 Any ideas? Thanks!
 
 Shannon
 
 On 6/26/14, 10:14 AM, Akhil Das wrote:
 Can you paste your spark-env.sh file?
 
 Thanks
 Best Regards
 
 
 On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu wrote:
 Both /etc/hosts have each other's IP addresses in them. Telneting from 
 machine2 to machine1 on port 5060 works just fine.
 
 Here's the output of lsof:
 
 user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
 COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
 java23985 user   30u  IPv6 11092354  0t0  TCP machine1:sip (LISTEN)
 java23985 user   40u  IPv6 11099560  0t0  TCP 
 machine1:sip-machine1:48315 (ESTABLISHED)
 java23985 user   52u  IPv6 11100405  0t0  TCP 
 machine1:sip-machine2:54476 (ESTABLISHED)
 java24157 user   40u  IPv6 11092413  0t0  TCP 
 machine1:48315-machine1:sip (ESTABLISHED)
 
 Ubuntu seems to recognize 5060 as the standard port for sip; it's not 
 actually running anything there besides Spark, it just does a s/5060/sip/g.
 
 Is there something to the fact that every time I comment out SPARK_LOCAL_IP 
 in spark-env, it crashes immediately upon spark-submit due to the address 
 already being in use? Or am I barking up the wrong tree on that one?
 
 Thanks again for all your help; I hope we can knock this one out.
 
 Shannon
 
 
 On 6/26/14, 9:13 AM, Akhil Das wrote:
 Do you have ipmachine1 in your workers /etc/hosts also? If 
 so try telneting from your machine2 to machine1 on port 5060. Also make 
 sure nothing else is running on port 5060 other   
 than Spark (lsof -i:5060)
 
 Thanks
 Best Regards
 
 
 On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu wrote:
 Still running into the same problem. /etc/hosts on the master says
 
 127.0.0.1localhost
 ipmachine1
 
 ip is the same address set in spark-env.sh for SPARK_MASTER_IP. Any 
 other ideas?
 
 
 On 6/26/14, 3:11 AM, Akhil Das wrote:
 Hi Shannon,
 
 It should be a configuration issue, check in your /etc/hosts and make 
 sure localhost is not associated with the SPARK_MASTER_IP you provided.
 
 Thanks
 Best Regards
 
 
 On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn squ...@gatech.edu wrote:
 Hi all,
 
 I have a 2-machine Spark network I've set up: a master and worker on 
 machine1, and worker on machine2. When I run 'sbin/start-all.sh', 
 everything starts up as it should. I see both workers   
 listed on the UI page. The logs of both workers 
 indicate successful registration with the Spark master.
 
 The problems begin when I attempt to submit a job: I get an address 
 already in use exception that crashes the program. It says Failed to 
 bind to  and lists the exact port and address of the master.
 
 At this point, the only items I have set in my spark-env.sh are 
 SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
 
 The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the 
 master to 127.0.0.1. This allows the master to successfully send out 
 the jobs; however, it ends up canceling the stage after running this 
 command several times:
 
 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: 
 app-20140625210032-/8 on worker-20140625205623-machine2-53597 
 (machine2:53597) with 8 cores
 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID 
 app-20140625210032-/8 on hostPort machine2:53597 with 8 cores, 8.0 
 GB RAM
 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: 
 app-20140625210032-/8 is now RUNNING
 14/06/25 21:00:49 INFO AppClient

Spark standalone network configuration problems

2014-06-25 Thread Shannon Quinn

Hi all,

I have a 2-machine Spark network I've set up: a master and worker on 
machine1, and worker on machine2. When I run 'sbin/start-all.sh', 
everything starts up as it should. I see both workers listed on the UI 
page. The logs of both workers indicate successful registration with the 
Spark master.


The problems begin when I attempt to submit a job: I get an address 
already in use exception that crashes the program. It says Failed to 
bind to  and lists the exact port and address of the master.


At this point, the only items I have set in my spark-env.sh are 
SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).


The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the 
master to 127.0.0.1. This allows the master to successfully send out the 
jobs; however, it ends up canceling the stage after running this command 
several times:


14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: 
app-20140625210032-/8 on worker-20140625205623-machine2-53597 
(machine2:53597) with 8 cores
14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20140625210032-/8 on hostPort machine2:53597 with 8 cores, 8.0 
GB RAM
14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: 
app-20140625210032-/8 is now RUNNING
14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: 
app-20140625210032-/8 is now FAILED (Command exited with code 1)


The /8 started at /1, eventually becomes /9, and then /10, at 
which point the program crashes. The worker on machine2 shows similar 
messages in its logs. Here are the last bunch:


14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-/9 
finished with state FAILED message Command exited with code 1 exitStatus 1
14/06/25 21:00:31 INFO Worker: Asked to launch executor 
app-20140625210032-/10 for app_name
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
14/06/25 21:00:32 INFO ExecutorRunner: Launch command: java -cp 
::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar 
-XX:MaxPermSize=128m -Xms8192M -Xmx8192M 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler* 10 
machine2 8 akka.tcp://sparkWorker@machine2:53597/user/Worker 
app-20140625210032-
14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-/10 
finished with state FAILED message Command exited with code 1 exitStatus 1


I highlighted the part that seemed strange to me; that's the master port 
number (I set it to 5060), and yet it's referencing localhost? Is this 
the reason why machine2 apparently can't seem to give a confirmation to 
the master once the job is submitted? (The logs from the worker on the 
master node indicate that it's running just fine)


I appreciate any assistance you can offer!

Regards,
Shannon Quinn