Hi,
> If you have any tutorial for extracting data from complex nested json
>arrays (as the example given in my previous email), please send it.
90% of working with the real world is cleansing bad data. People
under-sell hive's flexibility in situations like this.
This is what I do
hive> compile `
import org.apache.hadoop.hive.ql.exec.UDF \;
import groovy.json.JsonSlurper \;
import org.apache.hadoop.io.Text \;
public class JsonExtract extends UDF {
public int evaluate(Text a){
def jsonSlurper = new JsonSlurper() \;
def obj = jsonSlurper.parseText(a.toString())\;
return obj.val1\;
}
} ` AS GROOVY NAMED json_extract.groovy;
hive> CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract';
hive> select json_extract('{"val1": 2}') from date_dim limit 1;
select json_extract('{"val1": 2}') from date_dim limit 1
OK
2
Time taken: 0.13 seconds, Fetched: 1 row(s)
Caveats - this generates bytecode at runtime, so keep an eye on the
hive> list jars;
Because there's no real namespacing, naming your classes/functions the
same while developing can drive you crazy (a little).
Cheers,
Gopal