It is worth to mention it is 100TB raw size, approximately 19TB with gzip
-9 (best/slowed compression)

On Wed, Apr 22, 2015 at 2:50 PM, Kjell Tore Fossbakk <[email protected]>
wrote:

> Hello [email protected]
>
> I have about 100 TB of data, approximately 180 billion events, in my HDFS
> cluster. It is my raw data stored as GZIP files. At the time of setup this
> was due to "saving the data" until we figured out what to do with it.
>
> After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
> last week I was amazed by the results presented. I did test Hive with ORC
> some time around May - August last year but had some issues with e.g.
> partitioning, bucketing and streaming data into ORC while also updating the
> row indexes. In addition, @t3rmin4t0r also presented the used of bloom
> filters.
>
> I have decided I will move my raw-data into HIVE using ORC and zlib. How
> would you guys recommend I would do that? We have a setup for our stream
> processing which takes the same data and puts it into Kafka. Then one Storm
> topology parse each event into a JSON format which we move back to another
> Kafka topic. We then consume this parsed_topic to put the data into e.g.
> Elasticsearch etc
>
> Due to the nature of the size of my data I only have 2-3 weeks of data in
> Kafka. So it is not an option to just reset the offsets and use storm on
> the data inside Kafka to stream them to Hive/ORC. I think with regards to
> speed map-reduce would probablt do this faster than pushing it through
> storm. However, laster I will add a storm-topology to read the newly
> created events from the parsed_topic and stream them into Hive/ORC.
>
> My options;
> 1) write a map-reduce job which reads the GZIP files in HDFS and import my
> Java libs to parse each line of event and put them to Hive/ORC.
> 2) write a storm-topology to read the parsed_topic and stream them to
> Hive/ORC. Which also means I would need to have something which reads the
> GZIP files from HDFS and puts them to Kafka to enable all of my on-disk
> events be processed.
> 3) use spark instead of map-reduce. Only, I dont see any benefits in doing
> so with this scenario.
>
> Thoughts? Insight?
>
> Thanks,
> Kjell Tore
>

Reply via email to