PySpark definetly works for me in ipython notebook. A good way to debug is
do setMaster(local) in your python sc object, see if that works. Then
from there, modify it to point to the real spark server.
Also, I added a hack where i did sys.path.insert the path to pyspark in my
python note book to get it working properly.
You can try these instructions out if you want which i recently put
together based on some other stuff online + a few minor modifications .
http://jayunit100.blogspot.com/2014/07/ipython-on-spark.html
On Thu, Oct 9, 2014 at 2:50 PM, Andy Davidson a...@santacruzintegration.com
wrote:
I wonder if I am starting iPython notebook incorrectly. The example in my
original email does not work. It looks like stdout is not configured
correctly If I submit it as a python.py file It works correctly
Any idea how I what the problem is?
Thanks
Andy
From: Andrew Davidson a...@santacruzintegration.com
Date: Tuesday, October 7, 2014 at 4:23 PM
To: user@spark.apache.org user@spark.apache.org
Subject: bug with IPython notebook?
Hi
I think I found a bug in the iPython notebook integration. I am not sure
how to report it
I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the
cluster using the launch script provided by spark
I start iPython notebook on my cluster master as follows and use an ssh
tunnel to open the notebook in a browser running on my local computer
ec2-user@ip-172-31-20-107 ~]$ IPYTHON_OPTS=notebook --pylab inline
--no-browser --port=7000 /root/spark/bin/pyspark
Bellow is the code my notebook executes
Bug list:
1. Why do I need to create a SparkContext? If I run pyspark
interactively The context is created automatically for me
2. The print statement causes the output to be displayed in the
terminal I started pyspark, not in the notebooks output
Any comments or suggestions would be greatly appreciated
Thanks
Andy
import sys
from operator import add
from pyspark import SparkContext
# only stand alone jobs should create a SparkContext
sc = SparkContext(appName=pyStreamingSparkRDDPipe”)
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
def echo(data):
print python recieved: %s % (data) # output winds up in the shell
console in my cluster (ie. The machine I launched pyspark from)
rdd.foreach(echo)
print we are done
--
jay vyas