Re: "ETL" Options for Drill

Andries Engelbrecht Mon, 02 Nov 2015 07:28:28 -0800

You are probably on the right track to create a workspace for querying one or 2 
levels up on the directory structure, and then use CTAS on subdirectories for 
incremental data loads.


—Andries


> On Nov 2, 2015, at 5:29 AM, John Omernik <[email protected]> wrote:
> 
> So as Insert into isn't really an option, I would love some feed back on
> options with Drill for ETL. I understand Drill is very good with raw text,
> at the same time, for a highly used table, the advantages of taking text
> data and "final" storing it as Parquet would be handy.  I was thinking
> through some options, and I thought I'd lay them out here, partially to
> organize my thoughts, and partially to get feedback or look for other
> options.
> 
> We have data coming in on a regular basis, say every 30 minutes.  It's
> quasi batched already, and we are looking to reduce the lag time of the
> data as much as possible. To start out with things, this data makes sense
> to have a "directory" partition based on the date.  It can be fairly large
> data per data, and that will be the easiest to prune on.
> 
> I "may" have this data in a format that is already loaded Parquet files
> from a Hadoop system (I think Impala) but I am not sure how well these
> files are formed, and Drill may have some advantage to loading them itself.
> 
> So, with Drill, can I just copy the files into the "current" day partition
> and have it be good? Should I put subdirectories (dir1) with the time of
> the load or does that not give me anything?
> 
> At the end of the day, is there anything in Drill like Hive where I can say
> "INSERT INTO table(partition=curday) select * from table where
> partition=curday that could read all the loaded Parquet files as a set and
> rewrite them, potentially optimized for queries (adding file level
> partitions via PARTITIONED BY to the directory level partitions?)
> 
> Obviously without an INSERT INTO functionality , that wouldn't work, I
> could create table as a tmp partition like CREATE TABLE
> `tablename/dailyload` AS select * from `tablename` where dir0 = 'curday` If
> that completes, rm 'curday' and mv dailyload curday.  That should work
> right?
> 
> Will Drill work if I have all my days with only dir0 but the daily load has
> dir1? That may be a challenge, but maybe I don't have to have
> subdirectories for each hour.  I will admit I am coming at this from a Hive
> perspective, but trying to explain my thinking to see if folks can point
> out other options or ideas in the Drill mindset.
> 
> 
> Thanks!
> 
> John

Re: "ETL" Options for Drill

Reply via email to