So as Insert into isn't really an option, I would love some feed back on
options with Drill for ETL. I understand Drill is very good with raw text,
at the same time, for a highly used table, the advantages of taking text
data and "final" storing it as Parquet would be handy.  I was thinking
through some options, and I thought I'd lay them out here, partially to
organize my thoughts, and partially to get feedback or look for other
options.

We have data coming in on a regular basis, say every 30 minutes.  It's
quasi batched already, and we are looking to reduce the lag time of the
data as much as possible. To start out with things, this data makes sense
to have a "directory" partition based on the date.  It can be fairly large
data per data, and that will be the easiest to prune on.

I "may" have this data in a format that is already loaded Parquet files
from a Hadoop system (I think Impala) but I am not sure how well these
files are formed, and Drill may have some advantage to loading them itself.

So, with Drill, can I just copy the files into the "current" day partition
and have it be good? Should I put subdirectories (dir1) with the time of
the load or does that not give me anything?

At the end of the day, is there anything in Drill like Hive where I can say
"INSERT INTO table(partition=curday) select * from table where
partition=curday that could read all the loaded Parquet files as a set and
rewrite them, potentially optimized for queries (adding file level
partitions via PARTITIONED BY to the directory level partitions?)

Obviously without an INSERT INTO functionality , that wouldn't work, I
could create table as a tmp partition like CREATE TABLE
`tablename/dailyload` AS select * from `tablename` where dir0 = 'curday` If
that completes, rm 'curday' and mv dailyload curday.  That should work
right?

Will Drill work if I have all my days with only dir0 but the daily load has
dir1? That may be a challenge, but maybe I don't have to have
subdirectories for each hour.  I will admit I am coming at this from a Hive
perspective, but trying to explain my thinking to see if folks can point
out other options or ideas in the Drill mindset.


Thanks!

John

Reply via email to