Hi
I have a five node CDH 5.3 cluster running on CentOS 6.5, I also have a
separate
install of Spark 1.3.1. ( The CDH 5.3 install has Spark 1.2 but I wanted a
newer version. )
I managed to write some Scala based code using a Hive Context to connect to
Hive and
create/populate tables etc. I compiled my application using sbt and ran it
with spark-submit
in local mode.
My question concerns UDF's, specifically the function row_sequence function in
the hive-contrib
jar file i.e.
hiveContext.sql("""
ADD JAR
/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/hive-contrib-0.13.1-cdh5.3.3.jar
""")
hiveContext.sql("""
CREATE TEMPORARY FUNCTION row_sequence as
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
""")
val resRDD = hiveContext.sql("""
SELECT row_sequence(),t1.edu FROM
( SELECT DISTINCT education AS edu FROM adult3 ) t1
ORDER BY t1.edu
""")
This seems to generate its sequence in the map (?) phase of execution because
no matter how I fiddle
with the main SQL I could not get an ascending index for dimension data. i.e. I
always get
1 val1
1 val2
1 val3
instead of
1 val1
2 val2
3 val3
Im well aware that I can play with scala and get around this issue and I have
but I wondered whether others
have come across this and solved it ?
cheers
Mike F