Re: querying nested parquet directory structure

James Turton Fri, 20 Aug 2021 05:34:55 -0700

Some pointers, in case you're not already aware of them.


https://drill.apache.org/docs/querying-the-information-schema/

show files in dfs.foo;
show files in dfs.`/foo/bar`;

select * from information_schema.`files`;

In my experience, be careful of performance when using the last one. Especially if you've set the option storage.list_files_recursively = true;



On 2021/08/20 13:57, Rafael Jaimes III wrote:

Thanks Charles.

I'm wondering more along the lines if you don't know the name of the 
directories. In your example, you have to know that data1 and study1 are called 
that. How do you find this information? Sure you can examine the file system 
separate from Drill.

Is there information of the file system path names within Drill, such as in 
INFORMATION_SCHEMA or similar?

In short I'm wondering if it's possible to have a command like LIST TABLES; and 
have returned study1.data1 , study1.data2

On August 20, 2021 6:49:42 AM EDT, luoc <[email protected]> wrote:

Best practices. Schema-free in Drill.

在 2021年8月20日，12:04，Charles Givre <[email protected]> 写道：

Hi Rafael,
If you're asking what I think you're asking, it sounds as if you'd like to 
query multiple files in a nested directory.  If that's the case, I have some 
good news...
Drill allows you to query entire directories as if they were one big file.  
Effectively Drill performs a UNION on those files, so the end result is that 
they appear to be one big table.
Thus, with the structure you provided, you could do the following:

SELECT ...
FROM dfs.`<path>/study1/data1`

That would roll up all the files under that directory path.  Now, there are 
some tricks that you should be aware of.  The first are implicit columns.  
These can help you figure out the directory structure as well as some basic 
filtering.  There are also some specific functions that are unique to querying 
directories.  Take a look at the links below for references about the implicit 
fields as well as the directory functions.

https://drill.apache.org/docs/querying-a-file-system-introduction/ 
<https://drill.apache.org/docs/querying-a-file-system-introduction/>
https://drill.apache.org/docs/querying-directories/ 
<https://drill.apache.org/docs/querying-directories/>

Best,
-- C

On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote:

Hi all,

I have a large dataset of parquet files that are nested within several
subdirectories. For example:

study1
|----data1
   |----2020-01-01
       |---0001.parquet
|----data2

study2
|----dataA
|----dataB

Is it possible for Drill to report back the "directories" as "tables"? For
example to perform a query and return something that tells me the directory
structure?

I've read something about creating workspaces, but to do so for each of the
directories seems onerous, and also requires going into the storage plugin
configuration.

The alternative would be to implement some logic and traverse the file
system, outside of Drill, and then use that information to drive the
"tables" for the queries. Although, that seems unintuitive provided Drill's
ability to traverse the file system, infer schema, create cache, and so on.

Thanks,
Rafael

Re: querying nested parquet directory structure

Reply via email to