I have a streaming process that writing to an avro table, with Schema etc. It's coming from BroIDS Connection logs, so my table name is like this:
broconnavro/YYYY-MM-DD Basically I take any data that has come in on that day and put it into a dated folder in Avro format. 1. Avro support is hard, and select * from the table is weird. 2. I'd like to get my data into Parquet long term. So I came up with an idea. I have a table called broconnparq. That has the same format. (broconnparq/YYYY-MM-DD). My idea this, I have a view that is essentially CREATE OR REPLACE VIEW view_broconn select * from ( select a.*, 'curdate' as dir0 from broconavro/curdate a UNION ALL select * from broconparq b ) c Then every evening, once my days roll over I do a CTAS from the current day Avro to the Parquet... So essentially I'd be Querying PArquet table + today's Avro.... My main question is this... what am I losing here (optimizations etc) Is this going to kill me on performance at scale? I.e. if I had 4 TB a day of data, will I regret this? For some reference, I am looking the lagging Avro support and lagging INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as reasons for this work around... I'd be open to any ideas here! John
