To get shark on LZO files working (I have it up and running with CDH4.4.0) you first need the hadoop-lzo jar on the classpath for shark (and spark). Hadoop-lzo seems to require its native code component, unlike Hadoop which can run non-native if it can't find native. So you'll need to add hadoop-lzo's native component to the library path too.
Here's an excerpt from my puppet module that does these things. Edit accordingly and put these two rows into your shark-env.sh export SPARK_LIBRARY_PATH="<%= scope['common::masterBaseDir'] %>/hadoop-current/lib/native/" export SPARK_CLASSPATH="<%= scope['common::masterBaseDir'] %>/hadoop-current/lib/hadoop-lzo.jar" And here's what I have in hadoop-current/lib/native: [user@machine hadoop-current]$ ls bin hadoop-ant-2.0.0-mr1-cdh4.4.0.jar hadoop-examples-2.0.0-mr1-cdh4.4.0.jar hadoop-tools-2.0.0-mr1-cdh4.4.0.jar lib logs webapps conf hadoop-core-2.0.0-mr1-cdh4.4.0.jar hadoop-test-2.0.0-mr1-cdh4.4.0.jar include libexec sbin [user@machine hadoop-current]$ ls lib/native/ libgplcompression.a libgplcompression.la libgplcompression.so libgplcompression.so.0 libgplcompression.so.0.0.0 Linux-amd64-64 [user@machine hadoop-current]$ Does that help? Andrew On Wed, Jan 8, 2014 at 7:02 AM, [email protected] <[email protected]>wrote: > HI, > I do a query from shark , it read a compress data from hdfs . but > spark could't find the native-lzo lib . > > 14/01/08 22:58:21 ERROR executor.Executor: Exception in task ID 286 > java.lang.RuntimeException: native-lzo library not available > at > com.hadoop.compression.lzo.LzoCodec.getDecompressorType(LzoCodec.java:175) > at > org.apache.hadoop.hive.ql.io.CodecPool.getDecompressor(CodecPool.java:122) > at org.apache.hadoop.hive.ql.io.RCFile$Reader.init(RCFile.java:1299) > at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1139) > at org.apache.hadoop.hive.ql.io.RCFile$Reader.<init>(RCFile.java:1118) > at > org.apache.hadoop.hive.ql.io.RCFileRecordReader.<init>(RCFileRecordReader.java:52) > at > org.apache.hadoop.hive.ql.io.RCFileInputFormat.getRecordReader(RCFileInputFormat.java:57) > at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:93) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:83) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:51) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:36) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:29) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:69) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at > org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at > org.apache.spark.rdd.MapPartitionsWithIndexRDD.compute(MapPartitionsWithIndexRDD.scala:40) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:226) > at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > > can anyone give me the hint > > thank you ! > > ------------------------------ > [email protected] >
