Performance on Unions - Some ideas please!

John Omernik Wed, 23 Aug 2017 05:49:41 -0700

I have a streaming process that writing to an avro table, with Schema etc.
It's coming from BroIDS Connection logs, so my table name is like this:



broconnavro/YYYY-MM-DD


Basically I take any data that has come in on that day and put it into a
dated folder in Avro format.

1. Avro support is hard, and select * from the table is weird.
2. I'd like to get my data into Parquet long term.


So I came up with an idea. I have a table called broconnparq.  That has the
same format. (broconnparq/YYYY-MM-DD).

My idea this, I have a view that is essentially
CREATE OR REPLACE VIEW view_broconn
select * from (
  select a.*, 'curdate' as dir0 from broconavro/curdate a
  UNION ALL
  select * from broconparq b
) c

Then every evening, once my days roll over I do a CTAS from the current day
Avro to the Parquet...

So essentially I'd be Querying PArquet table + today's Avro....

My main question is this... what am I losing here (optimizations etc) Is
this going to kill me on performance at scale? I.e. if I had 4 TB a day of
data, will I regret this?


For some reference, I am looking the lagging Avro support and lagging
INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as
reasons for this work around...

I'd be open to any ideas here!

John

Performance on Unions - Some ideas please!

Reply via email to