> I have about 100 TB of data, approximately 180 billion events, in my >HDFS cluster. It is my raw data stored as GZIP files. At the time of >setup this was due to "saving the data" until we figured out what to do >with it. > > After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels >last week I was amazed by the results presented.
I run at the very cutting edge of the builds all the time :) The bloom filters are there in hive-1.2.0 which is currently sitting in svn/git today. > I have decided I will move my raw-data into HIVE using ORC and zlib. How >would you guys recommend I would do that? The best mechanism is always to write it via a Hive SQL ETL query. The real question is how the events are exactly organized. Is it a flat structure with something like a single line of JSON for each data item? That is much more easy to process than other data formats - the gzipped data can be natively read by Hive without any trouble. The Hive-JSON-Serde is very useful for that, because it allows you to read random data out of the system - each ³view² would be an external table enforcing a schema onto a fixed data set (including maps/arrays). You would create maybe 3-4 of these schema-on-read tables, then insert into your ORC structures from those tables. If you had binary data, then it would be much easier to write a convertor to JSON & then follow the same process as well instead of attempting a direct ORC writer, if you want >1 views out of the same table using external tables. > 2) write a storm-topology to read the parsed_topic and stream them to >Hive/ORC. You need to effectively do that to keep a live system running. We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL speeds (see https://github.com/apache/storm/tree/master/external/storm-hive). That needs some metastore tweaking to work perfectly (tables to be marked transactional etc), but nothing beyond config params. > 3) use spark instead of map-reduce. Only, I dont see any benefits in >doing so with this scenario. The ORC writers in Spark (even if you merge the PR SPARK-2883) are really slow because they are built against hive-13.x (which was my ³before² comparison in all my slides). I really wish they¹d merge those changes into a release, so that I could make ORC+Spark fast. Cheers, Gopal
