So, some disadvantages I've already thought of (mine are more logistics, I am still very curious on performance etc).
1. I will have to update my view every every night. I don't believe Drill has a way to address the folder date at run time. That said, I could also easily, instead of a the current day, use "current" as my folder name. I am reluctant though, because dir0 would not be correct in the query or would it be... I could use tsday which is also the date as dir0 for the avro data, and dir0 just be dir0 for the parquet... that would ensure I would not have to update... however, it would be harder to delineate when one day started and another ended in the avro data... 2. I would have to orchestrate both the CTAS and the View Update outside of Drill, not a huge pain, but I like self contained setups :) John On Wed, Aug 23, 2017 at 7:48 AM, John Omernik <[email protected]> wrote: > I have a streaming process that writing to an avro table, with Schema etc. > It's coming from BroIDS Connection logs, so my table name is like this: > > > broconnavro/YYYY-MM-DD > > > Basically I take any data that has come in on that day and put it into a > dated folder in Avro format. > > 1. Avro support is hard, and select * from the table is weird. > 2. I'd like to get my data into Parquet long term. > > > So I came up with an idea. I have a table called broconnparq. That has > the same format. (broconnparq/YYYY-MM-DD). > > My idea this, I have a view that is essentially > CREATE OR REPLACE VIEW view_broconn > select * from ( > select a.*, 'curdate' as dir0 from broconavro/curdate a > UNION ALL > select * from broconparq b > ) c > > Then every evening, once my days roll over I do a CTAS from the current > day Avro to the Parquet... > > So essentially I'd be Querying PArquet table + today's Avro.... > > My main question is this... what am I losing here (optimizations etc) Is > this going to kill me on performance at scale? I.e. if I had 4 TB a day of > data, will I regret this? > > > For some reference, I am looking the lagging Avro support and lagging > INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as > reasons for this work around... > > I'd be open to any ideas here! > > John > >
