Re: querying nested parquet directory structure

Charles Givre Thu, 19 Aug 2021 21:04:22 -0700

Hi Rafael, 
If you're asking what I think you're asking, it sounds as if you'd like to 
query multiple files in a nested directory.  If that's the case, I have some 
good news...
Drill allows you to query entire directories as if they were one big file.  
Effectively Drill performs a UNION on those files, so the end result is that 
they appear to be one big table. 
Thus, with the structure you provided, you could do the following:


SELECT ...
FROM dfs.`<path>/study1/data1`

That would roll up all the files under that directory path.  Now, there are 
some tricks that you should be aware of.  The first are implicit columns.  
These can help you figure out the directory structure as well as some basic 
filtering.  There are also some specific functions that are unique to querying 
directories.  Take a look at the links below for references about the implicit 
fields as well as the directory functions.  

https://drill.apache.org/docs/querying-a-file-system-introduction/ 
<https://drill.apache.org/docs/querying-a-file-system-introduction/>
https://drill.apache.org/docs/querying-directories/ 
<https://drill.apache.org/docs/querying-directories/>

Best,
-- C

 

> On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote:
> 
> Hi all,
> 
> I have a large dataset of parquet files that are nested within several
> subdirectories. For example:
> 
> study1
> |----data1
>    |----2020-01-01
>        |---0001.parquet
> |----data2
> 
> study2
> |----dataA
> |----dataB
> 
> Is it possible for Drill to report back the "directories" as "tables"? For
> example to perform a query and return something that tells me the directory
> structure?
> 
> I've read something about creating workspaces, but to do so for each of the
> directories seems onerous, and also requires going into the storage plugin
> configuration.
> 
> The alternative would be to implement some logic and traverse the file
> system, outside of Drill, and then use that information to drive the
> "tables" for the queries. Although, that seems unintuitive provided Drill's
> ability to traverse the file system, infer schema, create cache, and so on.
> 
> Thanks,
> Rafael

Re: querying nested parquet directory structure

Reply via email to