Hi Paul, One of the issues that I ran into was that while Drill does support LIMIT 0 queries, the RESTful interface does not. That would actually be a really helpful starting point, and one that I imagine would not be too difficult to implement. If you can give me some direction in terms of what files to look at, I can have a go at that. — C
> On Oct 25, 2017, at 1:13 AM, Paul Rogers <[email protected]> wrote: > > Hi All, > > Drill supports two forms of CSV files: those with headers and those without. > > When a file has headers (and Drill is configured to read them), you can get > the schema using the metadata API in xDBC or the native Drill client. Do a > LIMIT 0 and you’ll get back just the schema of Varchar columns inferred from > the CSV headers. > > On the other hand, if you read the CSV file without headers, then you already > know the schema: a single `columns` column that is a Varchar array — but I > suppose that is not terribly useful... > > What Drill does not do is return a table-like result with the metadata; like > what DESCRIBE would do. Instead, you have to use the xDBC metadata API to > retrieve the information. > > But, given that Drill does support LIMIT 0, perhaps some enterprising soul > might want to build the DESCRIBE support on top of it. You’d need some SQL > syntax (might already be in Calcite), then an operator that converts the > internal batch schema to a table of Java objects, like is done for system > options, etc. > > The next issue is that if your CSV file looks like this: > > fred,47,1960-05-06,14.56 > > It does not help much to find out that these are all Varchar columns. Would > be nice to have some light-weight way of defining the type (not a cast; the > cast loads the data as Varchar, then does a conversion, which can be slow.) > > But, Drill’s claim to fame is schema-on-read, so Drill has no way to specify > a schema at present. This would be another good weekend project. The storage > plugin would look for a schema file at plan time, load it, and pass the > schema information to the reader which would do the required conversions on > read. The result would be that CSV files behave more like Parquet files > (which do specify column type as well as name.) > > - Paul > >> On Oct 24, 2017, at 6:21 AM, Charles Givre <[email protected]> wrote: >> >> @Divya, >> I’m aware of that, but there is no way as it stands to discover the schema >> of a file, CSV or otherwise. I’m not talking about attempting to infer data >> types, just the simple number of columns, name of columns and what Drill >> thinks they are. I know that you can do this via a series of queries, but >> for files, there is no equivalent of DESCRIBE. >> That’s what I was getting at and I think that would be really useful to >> have. Also it wouldn’t really alter the character of Drill. >> — C >> >> >>> On Oct 24, 2017, at 02:02, Divya Gehlot <[email protected]> wrote: >>> >>> @Charles : when you read csv files can casting and get the data type as per >>> your requirements >>> >>> Hope I answer what you asked for :) >>> >>> HTH >>> Thanks, >>> Divya >>> >>> On 20 October 2017 at 01:42, Charles Givre <[email protected]> wrote: >>> >>>> Hi Alfredo, >>>> When I was trying to get Drill to work with various BI tools, all I really >>>> needed was a list of columns. Data types would be a big bonus, but Drill >>>> interprets CSV data as plain text anyway. It would be really useful for >>>> other file types where Drill does infer data types. >>>> >>>> >>>> — C >>>> >>>> >>>>> On Oct 19, 2017, at 6:13 AM, Alfredo Serafini <[email protected]> wrote: >>>>> >>>>> Hi thanks for the replies! >>>>> >>>>> @Chun yes using Views is an approach I considered, and I like it also >>>>> methodologically, in order to have some change to "prepare" the data >>>>> just a bit. I'm testing drill as a sort of data facade for tools which >>>>> handles mappings to other context, so this could be helpful for me. >>>>> >>>>> Anyway I have some concerns regardings metadata/catalog support for >>>>> views too: it seems that every view is saved on disk as a JSON file, >>>>> then experimenting the same issues. Are you suggesting saving views to >>>>> some kind of relational database storage for staging purposes? Is that >>>>> possible? >>>>> >>>>> Sorry for all the questions :-) >>>>> >>>>> >>>>> @Charles yes Metabase (or Tableau, Superset, and so on...) is another >>>>> use case in which it would be great to connect them to explore data >>>>> with the capabilities of drill, and even for an initial exploration of >>>>> data since sometimes reducing the initial analysis phase time could >>>>> help with development. >>>>> >>>>> For CSV it would be possible IMHO to guess types in a very basic way, >>>>> at least using basic types and map columns to a text/String when a >>>>> type can't be inferreed. It could be a starting point, and probably >>>>> the more confortable case where to start for the (partial) support of >>>>> catalog informations (JSON would be more complex, just to say). If >>>>> there are standard interfaces that can be extended/implemented for >>>>> filling them with those informations I'd like to do some >>>>> experimentation on that, if it's not too complex to follow, and if >>>>> someone can point me to a good place where to start for doing some >>>>> experiments of a possible implementation, for the CSV case. >>>>> >>>>> Thanks for the comments, I appreciate them >>>>> >>>>> Alfredo >>>>> >>>>> >>>>> >>>>> I’d like to second Alfredo’s request. I’ve been trying to get Drill >>>>> to work with some >>>>>> open source visualization tools such as SqlPad and Metabase and the >>>> issue I keep running into >>>>>> is that Drill doesn’t have a convenient way to describe how it >>>> interprets flat files. This >>>>>> is really frustrating for me since this is my main use of Drill! >>>>>> I wish the SELECT * FROM <data> LIMIT 0 worked in the RESTFul >>>> interface. In any event, >>>>>> would be very useful to have some way to get Drill to describe how it >>>> will interpret a flat >>>>>> file. >>>>>> — C >>>>> >>>>> >>>>> >>>>>> On Oct 18, 2017, at 15:20, Chun Chang <[email protected]> wrote: >>>>>> >>>>>> There were discussions on the need of building a catalog for drill. But >>>> I don't think >>>>> that's the focus right now. And I am not sure the community will ever >>>>> decide to go in that >>>>> direction. For now, you best bet is to create views on top of your >>>>> JSON/CSV data. >>>>>> >>>>>> ________________________________ >>>>>> From: Alfredo Serafini <[email protected]> >>>>>> Sent: Wednesday, October 18, 2017 8:31:15 AM >>>>>> To: [email protected] >>>>>> Subject: describe query support? (catalog metadata, etc) >>>>>> >>>>>> Hi I'm experimenting using Drill as a data virtualization component via >>>>>> JDBC and it generally works great for my needs. >>>>>> >>>>>> However some of the components connected via JDBC needs basic >>>>>> metadata/catalog informations, and they seems to be missing for JSON / >>>> CSV >>>>>> sources. >>>>>> >>>>>> For example the simple query >>>>>> >>>>>> DESCRIBE cp.`employee.json`; >>>>>> >>>>>> returns no results. >>>>>> >>>>>> Another possible example case could be when reading from an sqlite >>>> source >>>>>> containing the same data on an `employees` table >>>>>> DESCRIBE `emploees` >>>>>> >>>>>> and still get no information: while this command is not directly >>>> supported >>>>>> in SQLite, an equivalent one could be for instance: >>>>>> PRAGMA table_info(`employees`); >>>>>> >>>>>> but trying to execute it in Drill is not possible, as it is beyond the >>>>>> supported standard SQL dialect. >>>>>> >>>>>> Moreover using a query like: >>>>>> SELECT * >>>>>> FROM INFORMATION_SCHEMA.COLUMNS >>>>>> WHERE (TABLE_NAME='employees_view'); >>>>>> >>>>>> on a view from the same data, seems to return the informations, so I >>>>>> suppose there should be a way to pass those informations to an >>>>>> internal *DatabaseMetaData >>>>>> <https://docs.oracle.com/javase/8/docs/api/java/sql/ >>>> DatabaseMetaData.html>* >>>>>> implementation. >>>>>> I wonder if there is such a component designed to manage all the catalog >>>>>> informations for different sources? >>>>>> >>>>>> In this case it could adopt different strategies for retrieving >>>> metadata, >>>>>> depending on the case: for sqlite a different command / dialect could be >>>>>> used, for CSV types could be guessed using simple heuristics, and so on. >>>>>> Probably cases like JSON would be much more complex, anyway. >>>>>> Once the metadata have been retrieved for a source, I suppose the >>>> standard >>>>>> SQL dialect should work as expected. >>>>>> >>>>>> >>>>>> Are there any plans to add catalog metadata support for various sources? >>>>>> Does anybody have some workaround? for example using views or similar >>>>>> approaches? >>>>>> >>>>>> >>>>>> thanks in advance, sorry if the message is too long :-) >>>>>> Alfredo >>>> >>>> >> >
