> I have about 100 TB of data, approximately 180 billion events, in my
>HDFS cluster. It is my raw data stored as GZIP files. At the time of
>setup this was due to "saving the data" until we figured out what to do
>with it.
>
> After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
>last week I was amazed by the results presented.

I run at the very cutting edge of the builds all the time :)

The bloom filters are there in hive-1.2.0 which is currently sitting in
svn/git today.


> I have decided I will move my raw-data into HIVE using ORC and zlib. How
>would you guys recommend I would do that?

The best mechanism is always to write it via a Hive SQL ETL query.

The real question is how the events are exactly organized. Is it a flat
structure with something like a single line of JSON for each data item?

That is much more easy to process than other data formats - the gzipped
data can be natively read by Hive without any trouble.

The Hive-JSON-Serde is very useful for that, because it allows you to read
random data out of the system - each ³view² would be an external table
enforcing a schema onto a fixed data set (including maps/arrays).

You would create maybe 3-4 of these schema-on-read tables, then insert
into your ORC structures from those tables.

If you had binary data, then it would be much easier to write a convertor
to JSON & then follow the same process as well instead of attempting a
direct ORC writer, if you want >1 views out of the same table using
external tables.

> 2) write a storm-topology to read the parsed_topic and stream them to
>Hive/ORC.

You need to effectively do that to keep a live system running.

We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL
speeds (see 
https://github.com/apache/storm/tree/master/external/storm-hive).

That needs some metastore tweaking to work perfectly (tables to be marked
transactional etc), but nothing beyond config params.

> 3) use spark instead of map-reduce. Only, I dont see any benefits in
>doing so with this scenario.

The ORC writers in Spark (even if you merge the PR SPARK-2883) are really
slow because they are built against hive-13.x (which was my ³before²
comparison in all my slides).

I really wish they¹d merge those changes into a release, so that I could
make ORC+Spark fast.


Cheers,
Gopal 


Reply via email to