So, some disadvantages I've already thought of (mine are more logistics, I
am still very curious on performance etc).

1. I will have to update my view every every night. I don't believe Drill
has a way to address the folder date at run time.  That said, I could also
easily, instead of a the current day, use "current" as my folder name.  I
am reluctant though, because dir0 would not be correct in the query or
would it be... I could use tsday which is also the date as dir0 for the
avro data, and dir0 just be dir0 for the parquet... that would ensure I
would not have to update... however, it would be harder to delineate when
one day started and another ended in the avro data...

2. I would have to orchestrate both the CTAS and the View Update outside of
Drill, not a huge pain, but I like self contained setups :)

John


On Wed, Aug 23, 2017 at 7:48 AM, John Omernik <[email protected]> wrote:

> I have a streaming process that writing to an avro table, with Schema etc.
> It's coming from BroIDS Connection logs, so my table name is like this:
>
>
> broconnavro/YYYY-MM-DD
>
>
> Basically I take any data that has come in on that day and put it into a
> dated folder in Avro format.
>
> 1. Avro support is hard, and select * from the table is weird.
> 2. I'd like to get my data into Parquet long term.
>
>
> So I came up with an idea. I have a table called broconnparq.  That has
> the same format. (broconnparq/YYYY-MM-DD).
>
> My idea this, I have a view that is essentially
> CREATE OR REPLACE VIEW view_broconn
> select * from (
>   select a.*, 'curdate' as dir0 from broconavro/curdate a
>   UNION ALL
>   select * from broconparq b
> ) c
>
> Then every evening, once my days roll over I do a CTAS from the current
> day Avro to the Parquet...
>
> So essentially I'd be Querying PArquet table + today's Avro....
>
> My main question is this... what am I losing here (optimizations etc) Is
> this going to kill me on performance at scale? I.e. if I had 4 TB a day of
> data, will I regret this?
>
>
> For some reference, I am looking the lagging Avro support and lagging
> INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as
> reasons for this work around...
>
> I'd be open to any ideas here!
>
> John
>
>

Reply via email to