Some pointers, in case you're not already aware of them.
https://drill.apache.org/docs/querying-the-information-schema/
show files in dfs.foo;
show files in dfs.`/foo/bar`;
select * from information_schema.`files`;
In my experience, be careful of performance when using the last one.
Especially if you've set the option storage.list_files_recursively = true;
On 2021/08/20 13:57, Rafael Jaimes III wrote:
Thanks Charles.
I'm wondering more along the lines if you don't know the name of the
directories. In your example, you have to know that data1 and study1 are called
that. How do you find this information? Sure you can examine the file system
separate from Drill.
Is there information of the file system path names within Drill, such as in
INFORMATION_SCHEMA or similar?
In short I'm wondering if it's possible to have a command like LIST TABLES; and
have returned study1.data1 , study1.data2
On August 20, 2021 6:49:42 AM EDT, luoc <[email protected]> wrote:
Best practices. Schema-free in Drill.
在 2021年8月20日,12:04,Charles Givre <[email protected]> 写道:
Hi Rafael,
If you're asking what I think you're asking, it sounds as if you'd like to
query multiple files in a nested directory. If that's the case, I have some
good news...
Drill allows you to query entire directories as if they were one big file.
Effectively Drill performs a UNION on those files, so the end result is that
they appear to be one big table.
Thus, with the structure you provided, you could do the following:
SELECT ...
FROM dfs.`<path>/study1/data1`
That would roll up all the files under that directory path. Now, there are
some tricks that you should be aware of. The first are implicit columns.
These can help you figure out the directory structure as well as some basic
filtering. There are also some specific functions that are unique to querying
directories. Take a look at the links below for references about the implicit
fields as well as the directory functions.
https://drill.apache.org/docs/querying-a-file-system-introduction/
<https://drill.apache.org/docs/querying-a-file-system-introduction/>
https://drill.apache.org/docs/querying-directories/
<https://drill.apache.org/docs/querying-directories/>
Best,
-- C
On Aug 19, 2021, at 8:57 PM, Rafael Jaimes III <[email protected]> wrote:
Hi all,
I have a large dataset of parquet files that are nested within several
subdirectories. For example:
study1
|----data1
|----2020-01-01
|---0001.parquet
|----data2
study2
|----dataA
|----dataB
Is it possible for Drill to report back the "directories" as "tables"? For
example to perform a query and return something that tells me the directory
structure?
I've read something about creating workspaces, but to do so for each of the
directories seems onerous, and also requires going into the storage plugin
configuration.
The alternative would be to implement some logic and traverse the file
system, outside of Drill, and then use that information to drive the
"tables" for the queries. Although, that seems unintuitive provided Drill's
ability to traverse the file system, infer schema, create cache, and so on.
Thanks,
Rafael