I started a AWS cluster (1master + 3core) and download the prebuilt Spark
binary. I downloaded the latest Anaconda Python and started a iPython
notebook server by running the command below:
ipython notebook --port 9999 --profile nbserver --no-browser
Then, I try to develop a Spark application running on top of YARN
interactively in the iPython notebook:
Here is the code that I have written:
import sys
import os
from pyspark import SparkContext, SparkConf
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python')
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip')
os.environ["YARN_CONF_DIR"] = "/home/hadoop/conf"
os.environ["SPARK_HOME"] = "/home/hadoop/bwang/spark-1.3.1-bin-hadoop2.4"
conf = (SparkConf()
.setMaster("yarn-client")
.setAppName("Spark ML")
.set("spark.executor.memory", "2g")
)
sc = SparkContext(conf=conf)
data = sc.textFile("hdfs://
ec2-xx.xx.xx.xxxx.compute-1.amazonaws.com:8020/data/*")
data.count()
The code works all the way till the count, and it shows
"com.hadoop.compression.lzo.LzoCodec not found"..
Here <http://www.wepaste.com/sparkcompression/>is the full log.
I did some search, and it is basically around Spark cannot access Lzocodec
library.
I have tried to use os.environ to set the SPARK_CLASSPATH and
SPARK_LIBRARY_PATH to include the hadoop-lzo.jar which is located in
"./home/hadoop/.versions/2.4.0-amzn-4/share/hadoop/common/lib/hadoop-lzo.jar
" in AWS hadoop. However, it is still not working.
Can anyone show me how to solve this problem?