Hi John, Thanks, I tried with directory containing several parquet sub-directories. It works and looks in Drill like one parquet data source.
Not exactly what I want, but it's good workaround. Thanks again. On Wed, Dec 7, 2016 at 4:39 PM, John Omernik <[email protected]> wrote: > Alexander - > > When I have something like this, especially when the output will be > extremely large, I use CTAS into Parquet files. That said, I think you are > more looking at the ETL process for JSON. So, ignoring the CTAS to Parquet > for now, if you have a bunch of JSON files that will be loaded > incrementally into Drill, I use the "hidden" directory feature of Drill. > Let's, for this example say you have a table (directory) named mytable. > Inside of that you partition your table into subdirectories by days in > YYYY-MM-DD format. So your directory structure may be > > - mytable > ---- 2016-12-01 > ---- 2016-12-02 > ---- 2016-12-03 > > For simplicity, let's assume the date is just the load date. My ETL would > be this > > 1. Batch job starts today, 2016-12-07 > 2. Check for .2016-12-07 directory, it not exists, create it. > 3. Copy all new json into .2016-12-07 > 4. Check for 2016-12-07 directory, if not exists, create it > 5. Move all json in .2016-12-07 to 2016-12-07 > 6. Remove directory .2016-12-07 > > The reason for this process is simple, the copy process may cause "partial" > json records to be read by Drill during a query on the main data, thus > causing a query data. (Let's say a file is being copied and is partially > over when drill tries to query it). By default, Drill ignores directories > that start with . so by using a load directory with prefix of . you can > copy all the data in your batch to the clustered file system, and then use > a filesystem mv command which should be instant. (thus avoiding your query > errors). > > This is simplistic, but you should get the idea. > > John > > > > On Wed, Dec 7, 2016 at 7:08 AM, Alexander Reshetov < > [email protected]> wrote: > >> Hello, >> >> I want to load batches of unstructured data in Drill. Mostly JSON data. >> >> Is there any batch API or other options to do so? >> >> >> Thanks. >>
