It is worth to mention it is 100TB raw size, approximately 19TB with gzip -9 (best/slowed compression)
On Wed, Apr 22, 2015 at 2:50 PM, Kjell Tore Fossbakk <[email protected]> wrote: > Hello [email protected] > > I have about 100 TB of data, approximately 180 billion events, in my HDFS > cluster. It is my raw data stored as GZIP files. At the time of setup this > was due to "saving the data" until we figured out what to do with it. > > After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels > last week I was amazed by the results presented. I did test Hive with ORC > some time around May - August last year but had some issues with e.g. > partitioning, bucketing and streaming data into ORC while also updating the > row indexes. In addition, @t3rmin4t0r also presented the used of bloom > filters. > > I have decided I will move my raw-data into HIVE using ORC and zlib. How > would you guys recommend I would do that? We have a setup for our stream > processing which takes the same data and puts it into Kafka. Then one Storm > topology parse each event into a JSON format which we move back to another > Kafka topic. We then consume this parsed_topic to put the data into e.g. > Elasticsearch etc > > Due to the nature of the size of my data I only have 2-3 weeks of data in > Kafka. So it is not an option to just reset the offsets and use storm on > the data inside Kafka to stream them to Hive/ORC. I think with regards to > speed map-reduce would probablt do this faster than pushing it through > storm. However, laster I will add a storm-topology to read the newly > created events from the parsed_topic and stream them into Hive/ORC. > > My options; > 1) write a map-reduce job which reads the GZIP files in HDFS and import my > Java libs to parse each line of event and put them to Hive/ORC. > 2) write a storm-topology to read the parsed_topic and stream them to > Hive/ORC. Which also means I would need to have something which reads the > GZIP files from HDFS and puts them to Kafka to enable all of my on-disk > events be processed. > 3) use spark instead of map-reduce. Only, I dont see any benefits in doing > so with this scenario. > > Thoughts? Insight? > > Thanks, > Kjell Tore >
