Hello [email protected]

I have about 100 TB of data, approximately 180 billion events, in my HDFS
cluster. It is my raw data stored as GZIP files. At the time of setup this
was due to "saving the data" until we figured out what to do with it.

After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
last week I was amazed by the results presented. I did test Hive with ORC
some time around May - August last year but had some issues with e.g.
partitioning, bucketing and streaming data into ORC while also updating the
row indexes. In addition, @t3rmin4t0r also presented the used of bloom
filters.

I have decided I will move my raw-data into HIVE using ORC and zlib. How
would you guys recommend I would do that? We have a setup for our stream
processing which takes the same data and puts it into Kafka. Then one Storm
topology parse each event into a JSON format which we move back to another
Kafka topic. We then consume this parsed_topic to put the data into e.g.
Elasticsearch etc

Due to the nature of the size of my data I only have 2-3 weeks of data in
Kafka. So it is not an option to just reset the offsets and use storm on
the data inside Kafka to stream them to Hive/ORC. I think with regards to
speed map-reduce would probablt do this faster than pushing it through
storm. However, laster I will add a storm-topology to read the newly
created events from the parsed_topic and stream them into Hive/ORC.

My options;
1) write a map-reduce job which reads the GZIP files in HDFS and import my
Java libs to parse each line of event and put them to Hive/ORC.
2) write a storm-topology to read the parsed_topic and stream them to
Hive/ORC. Which also means I would need to have something which reads the
GZIP files from HDFS and puts them to Kafka to enable all of my on-disk
events be processed.
3) use spark instead of map-reduce. Only, I dont see any benefits in doing
so with this scenario.

Thoughts? Insight?

Thanks,
Kjell Tore

Reply via email to