Hello. thank you for your information and tips.
I will try a UDF with inspiration from get_json_object(). Thanks, Kjell Tore 22. apr. 2015 22:00 skrev "Gopal Vijayaraghavan" <[email protected]>: > > > In production we run HDP 2.2.4. Any thought when crazy stuff like bloom > >filters might move to GA? > > I¹d say that it will be in the next release, considering it is already > checked into hive-trunk. > > Bloom filters aren¹t too crazy today. They are written within the ORC file > right next to the row-index data, so that there¹s no staleness issues with > this today & after that they¹re fairly well-understood structures. > > I¹m working through ³bad use² safety scenarios like someone searching for > ³11² (as a string) in a data-set which contains doubles. > > Hive FilterOperator casts this dynamically, but the ORC PPD has to do > those type promotions exacty as hive would do in FilterOperator throughout > the bloom filter checks. > > Calling something production-ready needs that sort of work, rather than > the feature¹s happy path of best performance. > > > > The data is single-line text events. Nothing fancy, no multiline or any > >binary. Each event is 200 - 800 bytes long. > > The format of these events are in 5 types (from which application > >produce them) and none are JSON. I wrote a small lib with 5 Java classes > > which interface parse(String raw) and return a JSONObject - utilized in > >my Storm bolts. > > You could define that as a regular 1 column TEXTFILE and use a non-present > character as a delimiter (like ^A), which means you should be able to do > something like > > select x.a, x.b, x.c from (select parse_my_format(line) as x from > raw_text_table); > > a UDF is massively easier to write than a SerDe. > > I effectively do something similar with get_json_object() to extract 1 > column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table). > > > > So I need to write my own format reader, a custom SerDe - specifically > >the Deserializer part? Then 5 schema-on-read external tables using my > >custom SerDe. > ... > > That doesn't sound too bad! I expect bugs :) > > Well, the UDF returning a Struct is an alternative to writing a SerDe. > > > This all is just to catch up and clean our historical, garbage bin of > >data which piled up while we got Kafka - Storm - Elasticsearch running :-) > > One problem at a time, I guess. > > If any of this needs help, that¹s the sort of thing this list exists for. > > Cheers, > Gopal > > > >
