I have a similar requirement for json data, and I do ETL via drill to parquet. I still use CTAS and it works well, but coming from the Hive world, I missed "INSERT INTO" and "INSERT OVERWRITE" as well. Here is what I did.
I have a folder that my json data comes into. For me, my JSON data is loaded in directories by date (2015-01-01, 2015-01-02 etc) This wouldn't be a requirement, as you could work that into the CTAS Query I am about to discuss. But that's how it is for me. With my data, it comes in with new files every 5 minutes. Using Chronos in Mesos (this isn't required here, you could just have your script call the next script upon JSON load success) I have a job that is dependent on json load. When that gets new JSON data into the current day, this job comes right after and runs a CTAS on the current day of JSON. Instead of running the CTAS to the WHOLE table of parquet, I just do a CTAS to an individual day in Parquet... Let's summarize my directories MYTABLE_JSON --- 2015-01-01 --- 2015-01-02 MYTABLE_PARQUET --- 2015-01-01 --- 2015-01-02 So every 5 minutes, I do a CTAS of CREATE TABLE dfs.prod.`MYTABLE_PARQUET/.%CURDAY%` as select field1, field2, field3 from dfs.prod.`MYTABLE_JSON` where dir0 = '%CURDAY%'; (if your JSON wasn't sorted into directories, you could do something like from dfs.prod.`MYTABLE_JSON` where partition_date_field = '%CURDAY%`) So in this setup, a couple of notes... 1. Note the table name `MYTABLE_PARQUET/.%CURDAY%` (note the dot in front of %CURDAY%) I do this so it doesn't overwrite the current day data right away. Lets replace %CURDAY% with 2016-03-14 for readability to explain: CREATE TABLE dfs.prod.`MYTABLE_PARQUET/.2016-03-14` as select field1, field2, field3 from dfs.prod.`MYTABLE_JSON` where dir0 = '2016-03-14'; This creates the directory .2016-03-14 in my folder MYTABLE_PARQUET Presumably, the folder 2016-03-14 exists from the previous 5 minutes load. I do this so that as I am creating the updated current day of data, any queries that are happening can use the previous loads data, without issue. Directories that are hidden with a dot prefix are ignored by Drill. Once my Updated directory of parquet is written by drill, I can now issue a mv command to mv .2016-03-14 2016-03-14 replacing the previous load with my current load. This happens instantly. I would love the ability to append data, or even just use insert overwrite (which Hive does work like this for us) and I think there are some JIRAs open (if not we should open some) however, using Parquet with CTAS like this has allowed me to work within Drill's current structure and give me up-to-date parquet data I can use for fast queries. I hope this helps. John On Mon, Mar 14, 2016 at 3:14 PM, Christian Hellström <[email protected]> wrote: > So, the data is in HDFS, from where I need to transform it into a structure > that is appropriate for visualization, which I do with a set of views. > > Every day I want to pick up the newest files and pack them into Parquet > files, so that our BI tool runs a bit snappier, because running everything > off JSON files that are accessed through views is too slow. While new data > is inserted, a process over which I have no control, I UNION ALL the > previous days' data with the current day's data. > > I would normally write an INSERT INTO to run every night that only takes > the recent data. Since this is not supported I'm curious to see how else I > can solve this. As I said, a CTAS on all the existing data plus the latest > additions is not viable performance-wise, apart from the fact that it's an > ugly solution. Similarly, doing a CTAS to a dummy table followed by a copy > to the right directory is a hack that I don't consider acceptable for > production purposes. > > Any thoughts are greatly appreciated! >
