Alexander - When I have something like this, especially when the output will be extremely large, I use CTAS into Parquet files. That said, I think you are more looking at the ETL process for JSON. So, ignoring the CTAS to Parquet for now, if you have a bunch of JSON files that will be loaded incrementally into Drill, I use the "hidden" directory feature of Drill. Let's, for this example say you have a table (directory) named mytable. Inside of that you partition your table into subdirectories by days in YYYY-MM-DD format. So your directory structure may be
- mytable ---- 2016-12-01 ---- 2016-12-02 ---- 2016-12-03 For simplicity, let's assume the date is just the load date. My ETL would be this 1. Batch job starts today, 2016-12-07 2. Check for .2016-12-07 directory, it not exists, create it. 3. Copy all new json into .2016-12-07 4. Check for 2016-12-07 directory, if not exists, create it 5. Move all json in .2016-12-07 to 2016-12-07 6. Remove directory .2016-12-07 The reason for this process is simple, the copy process may cause "partial" json records to be read by Drill during a query on the main data, thus causing a query data. (Let's say a file is being copied and is partially over when drill tries to query it). By default, Drill ignores directories that start with . so by using a load directory with prefix of . you can copy all the data in your batch to the clustered file system, and then use a filesystem mv command which should be instant. (thus avoiding your query errors). This is simplistic, but you should get the idea. John On Wed, Dec 7, 2016 at 7:08 AM, Alexander Reshetov < [email protected]> wrote: > Hello, > > I want to load batches of unstructured data in Drill. Mostly JSON data. > > Is there any batch API or other options to do so? > > > Thanks. >
