Re: describe query support? (catalog metadata, etc)

Charles Givre Wed, 25 Oct 2017 07:31:16 -0700

Hi Paul, 
One of the issues that I ran into was that while Drill does support LIMIT 0 
queries, the RESTful interface does not.  That would actually be a really 
helpful starting point, and one that I imagine would not be too difficult to 
implement.  If you can give me some direction in terms of what files to look 
at, I can have a go at that. 
— C



> On Oct 25, 2017, at 1:13 AM, Paul Rogers <[email protected]> wrote:
> 
> Hi All,
> 
> Drill supports two forms of CSV files: those with headers and those without.
> 
> When a file has headers (and Drill is configured to read them), you can get 
> the schema using the metadata API in xDBC or the native Drill client. Do a 
> LIMIT 0 and you’ll get back just the schema of Varchar columns inferred from 
> the CSV headers.
> 
> On the other hand, if you read the CSV file without headers, then you already 
> know the schema: a single `columns` column that is a Varchar array — but I 
> suppose that is not terribly useful...
> 
> What Drill does not do is return a table-like result with the metadata; like 
> what DESCRIBE would do. Instead, you have to use the xDBC metadata API to 
> retrieve the information.
> 
> But, given that Drill does support LIMIT 0, perhaps some enterprising soul 
> might want to build the DESCRIBE support on top of it. You’d need some SQL 
> syntax (might already be in Calcite), then an operator that converts the 
> internal batch schema to a table of Java objects, like is done for system 
> options, etc.
> 
> The next issue is that if your CSV file looks like this:
> 
> fred,47,1960-05-06,14.56
> 
> It does not help much to find out that these are all Varchar columns. Would 
> be nice to have some light-weight way of defining the type (not a cast; the 
> cast loads the data as Varchar, then does a conversion, which can be slow.)
> 
> But, Drill’s claim to fame is schema-on-read, so Drill has no way to specify 
> a schema at present. This would be another good weekend project. The storage 
> plugin would look for a schema file at plan time, load it, and pass the 
> schema information to the reader which would do the required conversions on 
> read. The result would be that CSV files behave more like Parquet files 
> (which do specify column type as well as name.)
> 
> - Paul
> 
>> On Oct 24, 2017, at 6:21 AM, Charles Givre <[email protected]> wrote:
>> 
>> @Divya, 
>> I’m aware of that, but there is no way as it stands to discover the schema 
>> of a file, CSV or otherwise.  I’m not talking about attempting to infer data 
>> types, just the simple number of columns, name of columns and what Drill 
>> thinks they are.  I know that you can do this via a series of queries, but 
>> for files, there is no equivalent of DESCRIBE.
>> That’s what I was getting at and I think that would be really useful to 
>> have.  Also it wouldn’t really alter the character of Drill.
>> — C
>> 
>> 
>>> On Oct 24, 2017, at 02:02, Divya Gehlot <[email protected]> wrote:
>>> 
>>> @Charles : when you read csv files can casting and get the data type as per
>>> your requirements
>>> 
>>> Hope I answer what you asked for :)
>>> 
>>> HTH
>>> Thanks,
>>> Divya
>>> 
>>> On 20 October 2017 at 01:42, Charles Givre <[email protected]> wrote:
>>> 
>>>> Hi Alfredo,
>>>> When I was trying to get Drill to work with various BI tools, all I really
>>>> needed was a list of columns.  Data types would be a big bonus, but Drill
>>>> interprets CSV data as plain text anyway.  It would be really useful for
>>>> other file types where Drill does infer data types.
>>>> 
>>>> 
>>>> — C
>>>> 
>>>> 
>>>>> On Oct 19, 2017, at 6:13 AM, Alfredo Serafini <[email protected]> wrote:
>>>>> 
>>>>> Hi thanks for the replies!
>>>>> 
>>>>> @Chun yes using Views is an approach I considered, and I like it also
>>>>> methodologically, in order to have some change to "prepare" the data
>>>>> just a bit. I'm testing drill as a sort of data facade for tools which
>>>>> handles mappings to other context, so this could be helpful for me.
>>>>> 
>>>>> Anyway I have some concerns regardings metadata/catalog support for
>>>>> views too: it seems that every view is saved on disk as a JSON file,
>>>>> then experimenting the same issues. Are you suggesting saving views to
>>>>> some kind of relational database storage for staging purposes? Is that
>>>>> possible?
>>>>> 
>>>>> Sorry for all the questions :-)
>>>>> 
>>>>> 
>>>>> @Charles yes Metabase (or Tableau, Superset, and so on...) is another
>>>>> use case in which it would be great to connect them to explore data
>>>>> with the capabilities of drill, and even for an initial exploration of
>>>>> data since sometimes reducing the initial analysis phase time could
>>>>> help with development.
>>>>> 
>>>>> For CSV it would be possible IMHO to guess types in a very basic way,
>>>>> at least using basic types and map columns to a text/String when a
>>>>> type can't be inferreed. It could be a starting point, and probably
>>>>> the more confortable case where to start for the (partial) support of
>>>>> catalog informations (JSON would be more complex, just to say). If
>>>>> there are standard interfaces that can be extended/implemented for
>>>>> filling them with those informations I'd like to do some
>>>>> experimentation on that, if it's not too complex to follow, and if
>>>>> someone can point me to a good place where to start for doing some
>>>>> experiments of a possible implementation, for the CSV case.
>>>>> 
>>>>> Thanks for the comments, I appreciate them
>>>>> 
>>>>> Alfredo
>>>>> 
>>>>> 
>>>>> 
>>>>> I’d like to second Alfredo’s request.  I’ve been trying to get Drill
>>>>> to work with some
>>>>>> open source visualization tools such as SqlPad and Metabase and the
>>>> issue I keep running into
>>>>>> is that Drill doesn’t have a convenient way to describe how it
>>>> interprets flat files.  This
>>>>>> is really frustrating for me since this is my main use of Drill!
>>>>>> I wish the SELECT * FROM <data> LIMIT 0 worked in the RESTFul
>>>> interface.  In any event,
>>>>>> would be very useful to have some way to get Drill to describe how it
>>>> will interpret a flat
>>>>>> file.
>>>>>> — C
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Oct 18, 2017, at 15:20, Chun Chang <[email protected]> wrote:
>>>>>> 
>>>>>> There were discussions on the need of building a catalog for drill. But
>>>> I don't think
>>>>> that's the focus right now. And I am not sure the community will ever
>>>>> decide to go in that
>>>>> direction. For now, you best bet is to create views on top of your
>>>>> JSON/CSV data.
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Alfredo Serafini <[email protected]>
>>>>>> Sent: Wednesday, October 18, 2017 8:31:15 AM
>>>>>> To: [email protected]
>>>>>> Subject: describe query support? (catalog metadata, etc)
>>>>>> 
>>>>>> Hi I'm experimenting using Drill as a data virtualization component via
>>>>>> JDBC and it generally works great for my needs.
>>>>>> 
>>>>>> However some of the components connected via JDBC needs basic
>>>>>> metadata/catalog informations, and they seems to be missing for JSON /
>>>> CSV
>>>>>> sources.
>>>>>> 
>>>>>> For example the simple query
>>>>>> 
>>>>>> DESCRIBE cp.`employee.json`;
>>>>>> 
>>>>>> returns no results.
>>>>>> 
>>>>>> Another possible example case could be when reading from an sqlite
>>>> source
>>>>>> containing the same data on an `employees` table
>>>>>> DESCRIBE `emploees`
>>>>>> 
>>>>>> and still get no information: while this command is not directly
>>>> supported
>>>>>> in SQLite, an equivalent one could be for instance:
>>>>>> PRAGMA table_info(`employees`);
>>>>>> 
>>>>>> but trying to execute it in Drill is not possible, as it is beyond the
>>>>>> supported standard SQL dialect.
>>>>>> 
>>>>>> Moreover using a query like:
>>>>>> SELECT *
>>>>>> FROM INFORMATION_SCHEMA.COLUMNS
>>>>>> WHERE (TABLE_NAME='employees_view');
>>>>>> 
>>>>>> on a view from the same data, seems to return the informations, so I
>>>>>> suppose there should be a way to pass those informations to an
>>>>>> internal *DatabaseMetaData
>>>>>> <https://docs.oracle.com/javase/8/docs/api/java/sql/
>>>> DatabaseMetaData.html>*
>>>>>> implementation.
>>>>>> I wonder if there is such a component designed to manage all the catalog
>>>>>> informations for different sources?
>>>>>> 
>>>>>> In this case it could adopt different strategies for retrieving
>>>> metadata,
>>>>>> depending on the case: for sqlite a different command / dialect could be
>>>>>> used, for CSV types could be guessed using simple heuristics, and so on.
>>>>>> Probably cases like JSON would be much more complex, anyway.
>>>>>> Once the metadata have been retrieved for a source, I suppose the
>>>> standard
>>>>>> SQL dialect should work as expected.
>>>>>> 
>>>>>> 
>>>>>> Are there any plans to add catalog metadata support for various sources?
>>>>>> Does anybody have some workaround? for example using views or similar
>>>>>> approaches?
>>>>>> 
>>>>>> 
>>>>>> thanks in advance, sorry if the message is too long :-)
>>>>>> Alfredo
>>>> 
>>>> 
>> 
>

Re: describe query support? (catalog metadata, etc)

Reply via email to