Hi Paul, all thanks for sharing informations and ideas
But, given that Drill does support LIMIT 0, perhaps some enterprising soul > might want to build the DESCRIBE support on top of it. You’d need some SQL > syntax (might already be in Calcite), then an operator that converts the > internal batch schema to a table of Java objects, like is done for system > options, etc. > yes, I agree this would be very useful! > But, Drill’s claim to fame is schema-on-read, so Drill has no way to > specify a schema at present. This would be another good weekend project. > The storage plugin would look for a schema file at plan time, load it, and > pass the schema information to the reader which would do the required > conversions on read. The result would be that CSV files behave more like > Parquet files (which do specify column type as well as name.) inferring basic types from some row of data is possible in CSV, using some simple heuristics: I have a small class to do that and having an idea of where to plug it, it could be used as a starting point. A more solid approach could be extended later, covering much more cases, as well as elasticsearch does for example (when in schema-less mode) even for JSON. Can anybody share a reference to the involved classes where tose naive type inferences can be added? is it at calcite level maybe? thanks very much for all the references, Alfredo > ---------- Messaggio inoltrato ---------- > From: Paul Rogers <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Bcc: > Date: Wed, 25 Oct 2017 05:13:54 +0000 > Subject: Re: describe query support? (catalog metadata, etc) > Hi All, > > Drill supports two forms of CSV files: those with headers and those > without. > > When a file has headers (and Drill is configured to read them), you can > get the schema using the metadata API in xDBC or the native Drill client. > Do a LIMIT 0 and you’ll get back just the schema of Varchar columns > inferred from the CSV headers. > > On the other hand, if you read the CSV file without headers, then you > already know the schema: a single `columns` column that is a Varchar array > — but I suppose that is not terribly useful... > > What Drill does not do is return a table-like result with the metadata; > like what DESCRIBE would do. Instead, you have to use the xDBC metadata API > to retrieve the information. > > But, given that Drill does support LIMIT 0, perhaps some enterprising soul > might want to build the DESCRIBE support on top of it. You’d need some SQL > syntax (might already be in Calcite), then an operator that converts the > internal batch schema to a table of Java objects, like is done for system > options, etc. > > The next issue is that if your CSV file looks like this: > > fred,47,1960-05-06,14.56 > > It does not help much to find out that these are all Varchar columns. > Would be nice to have some light-weight way of defining the type (not a > cast; the cast loads the data as Varchar, then does a conversion, which can > be slow.) > > But, Drill’s claim to fame is schema-on-read, so Drill has no way to > specify a schema at present. This would be another good weekend project. > The storage plugin would look for a schema file at plan time, load it, and > pass the schema information to the reader which would do the required > conversions on read. The result would be that CSV files behave more like > Parquet files (which do specify column type as well as name.) > > - Paul > > > On Oct 24, 2017, at 6:21 AM, Charles Givre <[email protected]> wrote: > > > > @Divya, > > I’m aware of that, but there is no way as it stands to discover the > schema of a file, CSV or otherwise. I’m not talking about attempting to > infer data types, just the simple number of columns, name of columns and > what Drill thinks they are. I know that you can do this via a series of > queries, but for files, there is no equivalent of DESCRIBE. > > That’s what I was getting at and I think that would be really useful to > have. Also it wouldn’t really alter the character of Drill. > > — C > > > > > >> On Oct 24, 2017, at 02:02, Divya Gehlot <[email protected]> > wrote: > >> > >> @Charles : when you read csv files can casting and get the data type as > per > >> your requirements > >> > >> Hope I answer what you asked for :) > >> > >> HTH > >> Thanks, > >> Divya > >> > >> On 20 October 2017 at 01:42, Charles Givre <[email protected]> wrote: > >> > >>> Hi Alfredo, > >>> When I was trying to get Drill to work with various BI tools, all I > really > >>> needed was a list of columns. Data types would be a big bonus, but > Drill > >>> interprets CSV data as plain text anyway. It would be really useful > for > >>> other file types where Drill does infer data types. > >>> > >>> > >>> — C > >>> > >>> > >>>> On Oct 19, 2017, at 6:13 AM, Alfredo Serafini <[email protected]> > wrote: > >>>> > >>>> Hi thanks for the replies! > >>>> > >>>> @Chun yes using Views is an approach I considered, and I like it also > >>>> methodologically, in order to have some change to "prepare" the data > >>>> just a bit. I'm testing drill as a sort of data facade for tools which > >>>> handles mappings to other context, so this could be helpful for me. > >>>> > >>>> Anyway I have some concerns regardings metadata/catalog support for > >>>> views too: it seems that every view is saved on disk as a JSON file, > >>>> then experimenting the same issues. Are you suggesting saving views to > >>>> some kind of relational database storage for staging purposes? Is that > >>>> possible? > >>>> > >>>> Sorry for all the questions :-) > >>>> > >>>> > >>>> @Charles yes Metabase (or Tableau, Superset, and so on...) is another > >>>> use case in which it would be great to connect them to explore data > >>>> with the capabilities of drill, and even for an initial exploration of > >>>> data since sometimes reducing the initial analysis phase time could > >>>> help with development. > >>>> > >>>> For CSV it would be possible IMHO to guess types in a very basic way, > >>>> at least using basic types and map columns to a text/String when a > >>>> type can't be inferreed. It could be a starting point, and probably > >>>> the more confortable case where to start for the (partial) support of > >>>> catalog informations (JSON would be more complex, just to say). If > >>>> there are standard interfaces that can be extended/implemented for > >>>> filling them with those informations I'd like to do some > >>>> experimentation on that, if it's not too complex to follow, and if > >>>> someone can point me to a good place where to start for doing some > >>>> experiments of a possible implementation, for the CSV case. > >>>> > >>>> Thanks for the comments, I appreciate them > >>>> > >>>> Alfredo > >>>> > >>>> > >>>> > >>>> I’d like to second Alfredo’s request. I’ve been trying to get Drill > >>>> to work with some > >>>>> open source visualization tools such as SqlPad and Metabase and the > >>> issue I keep running into > >>>>> is that Drill doesn’t have a convenient way to describe how it > >>> interprets flat files. This > >>>>> is really frustrating for me since this is my main use of Drill! > >>>>> I wish the SELECT * FROM <data> LIMIT 0 worked in the RESTFul > >>> interface. In any event, > >>>>> would be very useful to have some way to get Drill to describe how it > >>> will interpret a flat > >>>>> file. > >>>>> — C > >>>> > >>>> > >>>> > >>>>> On Oct 18, 2017, at 15:20, Chun Chang <[email protected]> wrote: > >>>>> > >>>>> There were discussions on the need of building a catalog for drill. > But > >>> I don't think > >>>> that's the focus right now. And I am not sure the community will ever > >>>> decide to go in that > >>>> direction. For now, you best bet is to create views on top of your > >>>> JSON/CSV data. > >>>>> > >>>>> ________________________________ > >>>>> From: Alfredo Serafini <[email protected]> > >>>>> Sent: Wednesday, October 18, 2017 8:31:15 AM > >>>>> To: [email protected] > >>>>> Subject: describe query support? (catalog metadata, etc) > >>>>> > >>>>> Hi I'm experimenting using Drill as a data virtualization component > via > >>>>> JDBC and it generally works great for my needs. > >>>>> > >>>>> However some of the components connected via JDBC needs basic > >>>>> metadata/catalog informations, and they seems to be missing for JSON > / > >>> CSV > >>>>> sources. > >>>>> > >>>>> For example the simple query > >>>>> > >>>>> DESCRIBE cp.`employee.json`; > >>>>> > >>>>> returns no results. > >>>>> > >>>>> Another possible example case could be when reading from an sqlite > >>> source > >>>>> containing the same data on an `employees` table > >>>>> DESCRIBE `emploees` > >>>>> > >>>>> and still get no information: while this command is not directly > >>> supported > >>>>> in SQLite, an equivalent one could be for instance: > >>>>> PRAGMA table_info(`employees`); > >>>>> > >>>>> but trying to execute it in Drill is not possible, as it is beyond > the > >>>>> supported standard SQL dialect. > >>>>> > >>>>> Moreover using a query like: > >>>>> SELECT * > >>>>> FROM INFORMATION_SCHEMA.COLUMNS > >>>>> WHERE (TABLE_NAME='employees_view'); > >>>>> > >>>>> on a view from the same data, seems to return the informations, so I > >>>>> suppose there should be a way to pass those informations to an > >>>>> internal *DatabaseMetaData > >>>>> <https://docs.oracle.com/javase/8/docs/api/java/sql/ > >>> DatabaseMetaData.html>* > >>>>> implementation. > >>>>> I wonder if there is such a component designed to manage all the > catalog > >>>>> informations for different sources? > >>>>> > >>>>> In this case it could adopt different strategies for retrieving > >>> metadata, > >>>>> depending on the case: for sqlite a different command / dialect > could be > >>>>> used, for CSV types could be guessed using simple heuristics, and so > on. > >>>>> Probably cases like JSON would be much more complex, anyway. > >>>>> Once the metadata have been retrieved for a source, I suppose the > >>> standard > >>>>> SQL dialect should work as expected. > >>>>> > >>>>> > >>>>> Are there any plans to add catalog metadata support for various > sources? > >>>>> Does anybody have some workaround? for example using views or similar > >>>>> approaches? > >>>>> > >>>>> > >>>>> thanks in advance, sorry if the message is too long :-) > >>>>> Alfredo > >>> > >>> > > > >
