Hi All,

Drill supports two forms of CSV files: those with headers and those without.

When a file has headers (and Drill is configured to read them), you can get the 
schema using the metadata API in xDBC or the native Drill client. Do a LIMIT 0 
and you’ll get back just the schema of Varchar columns inferred from the CSV 
headers.

On the other hand, if you read the CSV file without headers, then you already 
know the schema: a single `columns` column that is a Varchar array — but I 
suppose that is not terribly useful...

What Drill does not do is return a table-like result with the metadata; like 
what DESCRIBE would do. Instead, you have to use the xDBC metadata API to 
retrieve the information.

But, given that Drill does support LIMIT 0, perhaps some enterprising soul 
might want to build the DESCRIBE support on top of it. You’d need some SQL 
syntax (might already be in Calcite), then an operator that converts the 
internal batch schema to a table of Java objects, like is done for system 
options, etc.

The next issue is that if your CSV file looks like this:

fred,47,1960-05-06,14.56

It does not help much to find out that these are all Varchar columns. Would be 
nice to have some light-weight way of defining the type (not a cast; the cast 
loads the data as Varchar, then does a conversion, which can be slow.)

But, Drill’s claim to fame is schema-on-read, so Drill has no way to specify a 
schema at present. This would be another good weekend project. The storage 
plugin would look for a schema file at plan time, load it, and pass the schema 
information to the reader which would do the required conversions on read. The 
result would be that CSV files behave more like Parquet files (which do specify 
column type as well as name.)

- Paul

> On Oct 24, 2017, at 6:21 AM, Charles Givre <[email protected]> wrote:
> 
> @Divya, 
> I’m aware of that, but there is no way as it stands to discover the schema of 
> a file, CSV or otherwise.  I’m not talking about attempting to infer data 
> types, just the simple number of columns, name of columns and what Drill 
> thinks they are.  I know that you can do this via a series of queries, but 
> for files, there is no equivalent of DESCRIBE.
> That’s what I was getting at and I think that would be really useful to have. 
>  Also it wouldn’t really alter the character of Drill.
> — C
> 
> 
>> On Oct 24, 2017, at 02:02, Divya Gehlot <[email protected]> wrote:
>> 
>> @Charles : when you read csv files can casting and get the data type as per
>> your requirements
>> 
>> Hope I answer what you asked for :)
>> 
>> HTH
>> Thanks,
>> Divya
>> 
>> On 20 October 2017 at 01:42, Charles Givre <[email protected]> wrote:
>> 
>>> Hi Alfredo,
>>> When I was trying to get Drill to work with various BI tools, all I really
>>> needed was a list of columns.  Data types would be a big bonus, but Drill
>>> interprets CSV data as plain text anyway.  It would be really useful for
>>> other file types where Drill does infer data types.
>>> 
>>> 
>>> — C
>>> 
>>> 
>>>> On Oct 19, 2017, at 6:13 AM, Alfredo Serafini <[email protected]> wrote:
>>>> 
>>>> Hi thanks for the replies!
>>>> 
>>>> @Chun yes using Views is an approach I considered, and I like it also
>>>> methodologically, in order to have some change to "prepare" the data
>>>> just a bit. I'm testing drill as a sort of data facade for tools which
>>>> handles mappings to other context, so this could be helpful for me.
>>>> 
>>>> Anyway I have some concerns regardings metadata/catalog support for
>>>> views too: it seems that every view is saved on disk as a JSON file,
>>>> then experimenting the same issues. Are you suggesting saving views to
>>>> some kind of relational database storage for staging purposes? Is that
>>>> possible?
>>>> 
>>>> Sorry for all the questions :-)
>>>> 
>>>> 
>>>> @Charles yes Metabase (or Tableau, Superset, and so on...) is another
>>>> use case in which it would be great to connect them to explore data
>>>> with the capabilities of drill, and even for an initial exploration of
>>>> data since sometimes reducing the initial analysis phase time could
>>>> help with development.
>>>> 
>>>> For CSV it would be possible IMHO to guess types in a very basic way,
>>>> at least using basic types and map columns to a text/String when a
>>>> type can't be inferreed. It could be a starting point, and probably
>>>> the more confortable case where to start for the (partial) support of
>>>> catalog informations (JSON would be more complex, just to say). If
>>>> there are standard interfaces that can be extended/implemented for
>>>> filling them with those informations I'd like to do some
>>>> experimentation on that, if it's not too complex to follow, and if
>>>> someone can point me to a good place where to start for doing some
>>>> experiments of a possible implementation, for the CSV case.
>>>> 
>>>> Thanks for the comments, I appreciate them
>>>> 
>>>> Alfredo
>>>> 
>>>> 
>>>> 
>>>> I’d like to second Alfredo’s request.  I’ve been trying to get Drill
>>>> to work with some
>>>>> open source visualization tools such as SqlPad and Metabase and the
>>> issue I keep running into
>>>>> is that Drill doesn’t have a convenient way to describe how it
>>> interprets flat files.  This
>>>>> is really frustrating for me since this is my main use of Drill!
>>>>> I wish the SELECT * FROM <data> LIMIT 0 worked in the RESTFul
>>> interface.  In any event,
>>>>> would be very useful to have some way to get Drill to describe how it
>>> will interpret a flat
>>>>> file.
>>>>> — C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 18, 2017, at 15:20, Chun Chang <[email protected]> wrote:
>>>>> 
>>>>> There were discussions on the need of building a catalog for drill. But
>>> I don't think
>>>> that's the focus right now. And I am not sure the community will ever
>>>> decide to go in that
>>>> direction. For now, you best bet is to create views on top of your
>>>> JSON/CSV data.
>>>>> 
>>>>> ________________________________
>>>>> From: Alfredo Serafini <[email protected]>
>>>>> Sent: Wednesday, October 18, 2017 8:31:15 AM
>>>>> To: [email protected]
>>>>> Subject: describe query support? (catalog metadata, etc)
>>>>> 
>>>>> Hi I'm experimenting using Drill as a data virtualization component via
>>>>> JDBC and it generally works great for my needs.
>>>>> 
>>>>> However some of the components connected via JDBC needs basic
>>>>> metadata/catalog informations, and they seems to be missing for JSON /
>>> CSV
>>>>> sources.
>>>>> 
>>>>> For example the simple query
>>>>> 
>>>>> DESCRIBE cp.`employee.json`;
>>>>> 
>>>>> returns no results.
>>>>> 
>>>>> Another possible example case could be when reading from an sqlite
>>> source
>>>>> containing the same data on an `employees` table
>>>>> DESCRIBE `emploees`
>>>>> 
>>>>> and still get no information: while this command is not directly
>>> supported
>>>>> in SQLite, an equivalent one could be for instance:
>>>>> PRAGMA table_info(`employees`);
>>>>> 
>>>>> but trying to execute it in Drill is not possible, as it is beyond the
>>>>> supported standard SQL dialect.
>>>>> 
>>>>> Moreover using a query like:
>>>>> SELECT *
>>>>> FROM INFORMATION_SCHEMA.COLUMNS
>>>>> WHERE (TABLE_NAME='employees_view');
>>>>> 
>>>>> on a view from the same data, seems to return the informations, so I
>>>>> suppose there should be a way to pass those informations to an
>>>>> internal *DatabaseMetaData
>>>>> <https://docs.oracle.com/javase/8/docs/api/java/sql/
>>> DatabaseMetaData.html>*
>>>>> implementation.
>>>>> I wonder if there is such a component designed to manage all the catalog
>>>>> informations for different sources?
>>>>> 
>>>>> In this case it could adopt different strategies for retrieving
>>> metadata,
>>>>> depending on the case: for sqlite a different command / dialect could be
>>>>> used, for CSV types could be guessed using simple heuristics, and so on.
>>>>> Probably cases like JSON would be much more complex, anyway.
>>>>> Once the metadata have been retrieved for a source, I suppose the
>>> standard
>>>>> SQL dialect should work as expected.
>>>>> 
>>>>> 
>>>>> Are there any plans to add catalog metadata support for various sources?
>>>>> Does anybody have some workaround? for example using views or similar
>>>>> approaches?
>>>>> 
>>>>> 
>>>>> thanks in advance, sorry if the message is too long :-)
>>>>> Alfredo
>>> 
>>> 
> 

Reply via email to