Hey Gopal. Thanks for your answers. I did some followups;
On Wed, Apr 22, 2015 at 3:46 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > > I have about 100 TB of data, approximately 180 billion events, in my > >HDFS cluster. It is my raw data stored as GZIP files. At the time of > >setup this was due to "saving the data" until we figured out what to do > >with it. > > > > After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels > >last week I was amazed by the results presented. > > I run at the very cutting edge of the builds all the time :) > > The bloom filters are there in hive-1.2.0 which is currently sitting in > svn/git today. > > In production we run HDP 2.2.4. Any thought when crazy stuff like bloom filters might move to GA? > > > I have decided I will move my raw-data into HIVE using ORC and zlib. How > >would you guys recommend I would do that? > > The best mechanism is always to write it via a Hive SQL ETL query. > > The real question is how the events are exactly organized. Is it a flat > structure with something like a single line of JSON for each data item? > The data is single-line text events. Nothing fancy, no multiline or any binary. Each event is 200 - 800 bytes long. The format of these events are in 5 types (from which application produce them) and none are JSON. I wrote a small lib with 5 Java classes which interface parse(String raw) and return a JSONObject - utilized in my Storm bolts. > > That is much more easy to process than other data formats - the gzipped > data can be natively read by Hive without any trouble. > > The Hive-JSON-Serde is very useful for that, because it allows you to read > random data out of the system - each ³view² would be an external table > enforcing a schema onto a fixed data set (including maps/arrays). > > You would create maybe 3-4 of these schema-on-read tables, then insert > into your ORC structures from those tables. > > If you had binary data, then it would be much easier to write a convertor > to JSON & then follow the same process as well instead of attempting a > direct ORC writer, if you want >1 views out of the same table using > external tables. > So I need to write my own format reader, a custom SerDe - specifically the Deserializer part? Then 5 schema-on-read external tables using my custom SerDe. And then again 5 new tables created as select statements from each of the 5 external tables stored as ORC with e.g. zlib? That doesn't sound too bad! I expect bugs :) > > 2) write a storm-topology to read the parsed_topic and stream them to > >Hive/ORC. > > You need to effectively do that to keep a live system running. > > We¹ve had some hiccups with the ORC feeder bolt earlier with the <2s ETL > speeds (see > https://github.com/apache/storm/tree/master/external/storm-hive). > > That needs some metastore tweaking to work perfectly (tables to be marked > transactional etc), but nothing beyond config params. > Ok. Thanks for the tip. We put all the data into Elasticsearch which is searchable for N days. Thus, it is not a big priority to have the data in Hive as quickly as possible but Storm does seem like a nice place to put my code to do this. When we'r able to stream parsed JSON data from Kafka into Hive/ORC we'll stop putting raw-gzip data onto HDFS. This all is just to catch up and clean our historical, garbage bin of data which piled up while we got Kafka - Storm - Elasticsearch running :-) > > 3) use spark instead of map-reduce. Only, I dont see any benefits in > >doing so with this scenario. > > The ORC writers in Spark (even if you merge the PR SPARK-2883) are really > slow because they are built against hive-13.x (which was my ³before² > comparison in all my slides). > Yes. I did play with hive < 13.x something, without tez. I liked your performance data better :-) > > I really wish they¹d merge those changes into a release, so that I could > make ORC+Spark fast. > > > Cheers, > Gopal > > > Thanks, Kjell Tore
