hi,thanks a lot for your help,with your help ,my hive-on-spark can work
well now
it take me long time to install and deploy.here are some advice,i think
we need to improve the installation documentation, allowing users to use
the least amount of time to compile and install
1)add which spark version we should pick from spark github if we select
built spark instead of download a spark pre-built,tell them the right
built commad!(not include Pyarn ,Phive)
2)if they get some error during built ,such as
[ERRO/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:[22,24]cannot
find symbol
[ERROR]symbol: class JobExecutionStatus,tell them what they can do?
for our users,first to use it ,then feel good or bad?
and if u need,i can add something to start document
thanks
yuemeng
On 2014/12/3 11:03, Xuefu Zhang wrote:
When you build Spark, remove -Phive as well as -Pyarn. When you run
hive queries, you may need to run "set spark.home=/path/to/spark/dir";
Thanks,
Xuefu
On Tue, Dec 2, 2014 at 6:29 PM, yuemeng1 <yueme...@huawei.com
<mailto:yueme...@huawei.com>> wrote:
hi,XueFu,thanks a lot for your help,now i will provide more detail
to reproduce this ssue:
1),i checkout a spark branch from hive
github(https://github.com/apache/hive/tree/spark on Nov 29,becasue
of for version now it will give something wrong about:Caused by:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ),
and built command:mvn clean package -DskipTests -Phadoop-2 -Pdist
after built i get package from
:/home/ym/hive-on-spark/hive1129/hive/packaging/target(apache-hive-0.15.0-SNAPSHOT-bin.tar.gz)
2)i checkout spark from
https://github.com/apache/spark/tree/v1.2.0-snapshot0,becasue of
spark branch-1.2 is with spark parent(1.2.1-SNAPSHOT),so i chose
v1.2.0-snapshot0 and i compare this spark's pom.xml with
spark-parent-1.2.0-SNAPSHOT.pom(get from
http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
there is only difference is spark-parent name,and built command is :
|mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean
package|
3)comand i execute in hive-shell:
./hive --auxpath
/opt/hispark/spark/assembly/target/scala-2.10/spark-assembly-1.2.0-hadoop2.4.0.jar(copy
this jar to hive dir lib already)
create table student(sno int,sname string,sage int,ssex string)
row format delimited FIELDS TERMINATED BY ',';
create table score(sno int,cno int,sage int) row format delimited
FIELDS TERMINATED BY ',';
load data local inpath
'/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/student.txt'
into table student;
load data local inpath
'/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/score.txt'
into table score;
set hive.execution.engine=spark;
set spark.master=spark://10.175.xxx.xxx:7077;
set spark.eventLog.enabled=true;
set spark.executor.memory=9086m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
select distinct st.sno,sname from student st join score sc
on(st.sno=sc.sno) where sc.cno IN(11,12,13) and st.sage > 28;(work
in mr)
4)
studdent.txt file
1,rsh,27,female
2,kupo,28,male
3,astin,29,female
4,beike,30,male
5,aili,31,famle
score.txt file
1,10,80
2,11,85
3,12,90
4,13,95
5,14,100
On 2014/12/2 23:28, Xuefu Zhang wrote:
Could you provide details on how to reproduce the issue? such as
the exact spark branch, the command to build Spark, how you build
Hive, and what queries/commands you run.
We are running Hive on Spark all the time. Our pre-commit test
runs without any issue.
Thanks,
Xuefu
On Tue, Dec 2, 2014 at 4:13 AM, yuemeng1 <yueme...@huawei.com
<mailto:yueme...@huawei.com>> wrote:
hi,XueFu
i checkout a spark branch from
sparkgithub(tags:v1.2.0-snapshot0)and i compare this spark's
pom.xml with spark-parent-1.2.0-SNAPSHOT.pom(get from
http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
there is only difference is follow:
in spark-parent-1.2.0-SNAPSHOT.pom
<artifactId>spark-parent</artifactId>
<version>1.2.0-SNAPSHOT</version>
and in v1.2.0-snapshot0
<artifactId>spark-parent</artifactId>
<version>1.2.0</version>
i think there is no essence diff,and i built v1.2.0-snapshot0
and deploy it as my spark clusters
when i run query about join two table ,it still give some
error what i show u earlier
Job aborted due to stage failure: Task 0 in stage 1.0 failed
4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID
7, datasight18): java.lang.NullPointerException+details
Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18):
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
at
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
i think my spark clusters did't had any problem,but why
always give me such error
On 2014/12/2 13:39, Xuefu Zhang wrote:
You need to build your spark assembly from spark 1.2 branch.
this should give your both a spark build as well as
spark-assembly jar, which you need to copy to Hive lib
directory. Snapshot is fine, and spark 1.2 hasn't been
released yet.
--Xuefu
On Mon, Dec 1, 2014 at 7:41 PM, yuemeng1
<yueme...@huawei.com <mailto:yueme...@huawei.com>> wrote:
hi.XueFu,
thanks a lot for your inforamtion,but as far as i know
,the latest spark version on github is
spark-snapshot-1.3,but there is no spark-1.2,only have a
branch-1.2 with spark-snapshot-1.2,can u tell me which
spark version i should built,and for now,that's
spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar produce
error like that
On 2014/12/2 11:03, Xuefu Zhang wrote:
It seems that wrong class, HiveInputFormat, is loaded.
The stacktrace is way off the current Hive code. You
need to build Spark 1.2 and copy spark-assembly jar to
Hive's lib directory and that it.
--Xuefu
On Mon, Dec 1, 2014 at 6:22 PM, yuemeng1
<yueme...@huawei.com <mailto:yueme...@huawei.com>> wrote:
hi,i built a hive on spark package and my spark
assembly jar is
spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar,when
i run a query in hive shell,before execute this query,
i set all the require which hive need with
spark.and i execute a join query :
select distinct st.sno,sname from student st join
score sc on(st.sno=sc.sno) where sc.cno
IN(11,12,13) and st.sage > 28;
but it failed,
get follow error in spark webUI:
Job aborted due to stage failure: Task 0 in stage
1.0 failed 4 times, most recent failure: Lost task
0.3 in stage 1.0 (TID 7, datasight18):
java.lang.NullPointerException+details
Job aborted due to stage failure: Task 0 in stage 1.0 failed 4
times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18):
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
at
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
at
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
can u give me a help to deal this probelm,and i
think my built was succussed!