You are probably on the right track to create a workspace for querying one or 2 levels up on the directory structure, and then use CTAS on subdirectories for incremental data loads.
—Andries > On Nov 2, 2015, at 5:29 AM, John Omernik <[email protected]> wrote: > > So as Insert into isn't really an option, I would love some feed back on > options with Drill for ETL. I understand Drill is very good with raw text, > at the same time, for a highly used table, the advantages of taking text > data and "final" storing it as Parquet would be handy. I was thinking > through some options, and I thought I'd lay them out here, partially to > organize my thoughts, and partially to get feedback or look for other > options. > > We have data coming in on a regular basis, say every 30 minutes. It's > quasi batched already, and we are looking to reduce the lag time of the > data as much as possible. To start out with things, this data makes sense > to have a "directory" partition based on the date. It can be fairly large > data per data, and that will be the easiest to prune on. > > I "may" have this data in a format that is already loaded Parquet files > from a Hadoop system (I think Impala) but I am not sure how well these > files are formed, and Drill may have some advantage to loading them itself. > > So, with Drill, can I just copy the files into the "current" day partition > and have it be good? Should I put subdirectories (dir1) with the time of > the load or does that not give me anything? > > At the end of the day, is there anything in Drill like Hive where I can say > "INSERT INTO table(partition=curday) select * from table where > partition=curday that could read all the loaded Parquet files as a set and > rewrite them, potentially optimized for queries (adding file level > partitions via PARTITIONED BY to the directory level partitions?) > > Obviously without an INSERT INTO functionality , that wouldn't work, I > could create table as a tmp partition like CREATE TABLE > `tablename/dailyload` AS select * from `tablename` where dir0 = 'curday` If > that completes, rm 'curday' and mv dailyload curday. That should work > right? > > Will Drill work if I have all my days with only dir0 but the daily load has > dir1? That may be a challenge, but maybe I don't have to have > subdirectories for each hour. I will admit I am coming at this from a Hive > perspective, but trying to explain my thinking to see if folks can point > out other options or ideas in the Drill mindset. > > > Thanks! > > John
