I have a similar requirement for json data, and I do ETL via drill to
parquet. I still use CTAS and it works well, but coming from the Hive
world, I missed "INSERT INTO" and "INSERT OVERWRITE" as well.  Here is what
I did.

I have a folder that my json data comes into. For me, my JSON data is
loaded in directories by date (2015-01-01, 2015-01-02 etc) This wouldn't be
a requirement, as you could work that into the CTAS Query I am about to
discuss. But that's how it is for me.  With my data, it comes in with new
files every 5 minutes. Using Chronos in Mesos (this isn't required here,
you could just have your script call the next script upon JSON load
success) I have a job that is dependent on json load. When that gets new
JSON data into the current day, this job comes right after and runs a CTAS
on the current day of JSON.  Instead of running the CTAS to the WHOLE table
of parquet, I just do a CTAS to an individual day in Parquet... Let's
summarize my directories

MYTABLE_JSON
--- 2015-01-01
--- 2015-01-02


MYTABLE_PARQUET
--- 2015-01-01
--- 2015-01-02

So every 5 minutes, I do a CTAS of

CREATE TABLE dfs.prod.`MYTABLE_PARQUET/.%CURDAY%` as select field1, field2,
field3 from dfs.prod.`MYTABLE_JSON` where dir0 = '%CURDAY%';

(if your JSON wasn't sorted into directories, you could do something like
from dfs.prod.`MYTABLE_JSON` where partition_date_field = '%CURDAY%`)

So in this setup, a couple of notes...

1. Note the table name `MYTABLE_PARQUET/.%CURDAY%`  (note the dot in front
of %CURDAY%) I do this so it doesn't overwrite the current day data right
away.  Lets replace %CURDAY% with 2016-03-14 for readability to explain:

CREATE TABLE dfs.prod.`MYTABLE_PARQUET/.2016-03-14` as select field1,
field2, field3 from dfs.prod.`MYTABLE_JSON` where dir0 = '2016-03-14';

This creates the directory .2016-03-14 in my folder MYTABLE_PARQUET
Presumably, the folder 2016-03-14 exists from the previous 5 minutes load.

I do this so that as I am creating the updated current day of data, any
queries that are happening can use the previous loads data, without issue.
Directories that are hidden with a dot prefix are ignored by Drill.

Once my Updated directory of parquet is written by drill, I can now issue a
mv command to mv .2016-03-14 2016-03-14 replacing the previous load with my
current load.  This happens instantly.

I would love the ability to append data, or even just use insert overwrite
(which Hive does work like this for us) and I think there are some JIRAs
open (if not we should open some) however, using Parquet with CTAS like
this has allowed me to work within Drill's current structure and give me
up-to-date parquet data I can use for fast queries.

I hope this helps.

John



On Mon, Mar 14, 2016 at 3:14 PM, Christian Hellström <[email protected]>
wrote:

> So, the data is in HDFS, from where I need to transform it into a structure
> that is appropriate for visualization, which I do with a set of views.
>
> Every day I want to pick up the newest files and pack them into Parquet
> files, so that our BI tool runs a bit snappier, because running everything
> off JSON files that are accessed through views is too slow. While new data
> is inserted, a process over which I have no control, I UNION ALL the
> previous days' data with the current day's data.
>
> I would normally write an INSERT INTO to run every night that only takes
> the recent data. Since this is not supported I'm curious to see how else I
> can solve this. As I said, a CTAS on all the existing data plus the latest
> additions is not viable performance-wise, apart from the fact that it's an
> ugly solution. Similarly, doing a CTAS to a dummy table followed by a copy
> to the right directory is a hack that I don't consider acceptable for
> production purposes.
>
> Any thoughts are greatly appreciated!
>

Reply via email to