Hello.

thank you for your information and tips.

I will try a UDF with inspiration from get_json_object().

Thanks,
Kjell Tore
22. apr. 2015 22:00 skrev "Gopal Vijayaraghavan" <[email protected]>:

>
> > In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
> >filters might move to GA?
>
> I¹d say that it will be in the next release, considering it is already
> checked into hive-trunk.
>
> Bloom filters aren¹t too crazy today. They are written within the ORC file
> right next to the row-index data, so that there¹s no staleness issues with
> this today & after that they¹re fairly well-understood structures.
>
> I¹m working through ³bad use² safety scenarios like someone searching for
> ³11² (as a string) in a data-set which contains doubles.
>
> Hive FilterOperator casts this dynamically, but the ORC PPD has to do
> those type promotions exacty as hive would do in FilterOperator throughout
> the bloom filter checks.
>
> Calling something production-ready needs that sort of work, rather than
> the feature¹s happy path of best performance.
>
>
> > The data is single-line text events. Nothing fancy, no multiline or any
> >binary. Each event is 200 - 800 bytes long.
> > The format of these events are in 5 types (from which application
> >produce them) and none are JSON. I wrote a small lib with 5 Java classes
> > which interface parse(String raw) and return a JSONObject - utilized in
> >my Storm bolts.
>
> You could define that as a regular 1 column TEXTFILE and use a non-present
> character as a delimiter (like ^A), which means you should be able to do
> something like
>
> select x.a, x.b, x.c from (select parse_my_format(line) as x from
> raw_text_table);
>
> a UDF is massively easier to write than a SerDe.
>
> I effectively do something similar with get_json_object() to extract 1
> column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table).
>
>
> > So I need to write my own format reader, a custom SerDe - specifically
> >the Deserializer part? Then 5 schema-on-read external tables using my
> >custom SerDe.
> ...
> > That doesn't sound too bad! I expect bugs :)
>
> Well, the UDF returning a Struct is an alternative to writing a SerDe.
>
> > This all is just to catch up and clean our historical, garbage bin of
> >data which piled up while we got Kafka - Storm - Elasticsearch running :-)
>
> One problem at a time, I guess.
>
> If any of this needs help, that¹s the sort of thing this list exists for.
>
> Cheers,
> Gopal
>
>
>
>

Reply via email to