I started a AWS cluster (1master + 3core) and download the prebuilt Spark
binary. I downloaded the latest Anaconda Python and started a iPython
notebook server by running the command below:

    ipython notebook --port 9999 --profile nbserver --no-browser

Then, I try to develop a Spark application running on top of YARN
interactively in the iPython notebook:

Here is the code that I have written:

import sys
import os
from pyspark import SparkContext, SparkConf
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python')
sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip')
os.environ["YARN_CONF_DIR"] = "/home/hadoop/conf"
os.environ["SPARK_HOME"] = "/home/hadoop/bwang/spark-1.3.1-bin-hadoop2.4"
conf = (SparkConf()
        .setMaster("yarn-client")
        .setAppName("Spark ML")
        .set("spark.executor.memory", "2g")
       )
sc = SparkContext(conf=conf)
data = sc.textFile("hdfs://
ec2-xx.xx.xx.xxxx.compute-1.amazonaws.com:8020/data/*")
data.count()

The code works all the way till the count, and it shows
"com.hadoop.compression.lzo.LzoCodec not found"..
Here <http://www.wepaste.com/sparkcompression/>is the full log.

I did some search, and it is basically around Spark cannot access Lzocodec
library.

I have tried to use os.environ to set the SPARK_CLASSPATH and
SPARK_LIBRARY_PATH to include the hadoop-lzo.jar which is located in
"./home/hadoop/.versions/2.4.0-amzn-4/share/hadoop/common/lib/hadoop-lzo.jar
" in AWS hadoop. However, it is still not working.

Can anyone show me how to solve this problem?

Reply via email to