Re: Batch load of unstructured data in Drill

John Omernik Wed, 07 Dec 2016 05:40:39 -0800

Alexander -

When I have something like this, especially when the output will be
extremely large, I use CTAS into Parquet files. That said, I think you are
more looking at the ETL process for JSON.  So, ignoring the CTAS to Parquet
for now, if you have a bunch of JSON files that will be loaded
incrementally into Drill, I use the "hidden" directory feature of Drill.
Let's, for this example say you have a table (directory) named mytable.
Inside of that you partition your table into subdirectories by days in
YYYY-MM-DD format. So your directory structure may be

- mytable
---- 2016-12-01
---- 2016-12-02
---- 2016-12-03

For simplicity, let's assume the date is just the load date.  My ETL would
be this

1. Batch job starts today, 2016-12-07
2.  Check for .2016-12-07 directory, it not exists, create it.
3. Copy all new json into .2016-12-07
4. Check for 2016-12-07 directory, if not exists, create it
5. Move all json in .2016-12-07 to 2016-12-07
6. Remove directory .2016-12-07

The reason for this process is simple, the copy process may cause "partial"
json records to be read by Drill during a query on the main data, thus
causing a query data. (Let's say a file is being copied and is partially
over when drill tries to query it).  By default, Drill ignores directories
that start with . so by using a load directory with prefix of . you can
copy all the data in your batch to the clustered file system, and then use
a filesystem mv command which should be instant.  (thus avoiding your query
errors).

This is simplistic, but you should get the idea.

John

On Wed, Dec 7, 2016 at 7:08 AM, Alexander Reshetov <
[email protected]> wrote:

> Hello,
>
> I want to load batches of unstructured data in Drill. Mostly JSON data.
>
> Is there any batch API or other options to do so?
>
>
> Thanks.
>

Re: Batch load of unstructured data in Drill

Reply via email to