Hello [email protected] I have about 100 TB of data, approximately 180 billion events, in my HDFS cluster. It is my raw data stored as GZIP files. At the time of setup this was due to "saving the data" until we figured out what to do with it.
After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels last week I was amazed by the results presented. I did test Hive with ORC some time around May - August last year but had some issues with e.g. partitioning, bucketing and streaming data into ORC while also updating the row indexes. In addition, @t3rmin4t0r also presented the used of bloom filters. I have decided I will move my raw-data into HIVE using ORC and zlib. How would you guys recommend I would do that? We have a setup for our stream processing which takes the same data and puts it into Kafka. Then one Storm topology parse each event into a JSON format which we move back to another Kafka topic. We then consume this parsed_topic to put the data into e.g. Elasticsearch etc Due to the nature of the size of my data I only have 2-3 weeks of data in Kafka. So it is not an option to just reset the offsets and use storm on the data inside Kafka to stream them to Hive/ORC. I think with regards to speed map-reduce would probablt do this faster than pushing it through storm. However, laster I will add a storm-topology to read the newly created events from the parsed_topic and stream them into Hive/ORC. My options; 1) write a map-reduce job which reads the GZIP files in HDFS and import my Java libs to parse each line of event and put them to Hive/ORC. 2) write a storm-topology to read the parsed_topic and stream them to Hive/ORC. Which also means I would need to have something which reads the GZIP files from HDFS and puts them to Kafka to enable all of my on-disk events be processed. 3) use spark instead of map-reduce. Only, I dont see any benefits in doing so with this scenario. Thoughts? Insight? Thanks, Kjell Tore
