Ok, as you can see I am reviving old threads :) I am back to working with Carvel (now called Superset) with Charles, and we've moved to a python interface based on the rest API (much cleaner easier to work with, and working better out of the box!)
That said, the issue of typed description of tables is still a challenge. For view that are typed, great, this is easy. The standard "DESCRIBE" table (view) produces the results in a format that makes sense to the dbapi and to caravel. I am having great success on tables that I create a view for. However, I "feel" like DESCRIBE table "should" be able to handle some other situations that given a few tunables would be better than the limit 0 method. So to demonstrate, the restapi providers no data on DESCRIBE for a parquet folder or a LIMIT 0 (See be low) the Limit 0 makes it so it returns nothing... (see my results below) Thus, while ODBC may provide more with Limit 0, instead of hacking Rest to use the weird limit 0 like ODBC, why don't we fix DESCRIBE with tunables? I.e. DESCRIBE parquet_folder... read the first parquet file in there, get the metadata, return the columns and types.... if other files change schema... well, the query will break too? Perhaps a read of cached meta data? Perhaps a tunable that would say read X files ( -1 being all) to return the schema of the folder. Also, DESCRIBE csv folder: Well, if columns/headers are setup, then just come back with column[0] to columns[1], if header reading for decribe is turned on, then have describe read the first line and return the data... Types? ANY... we can default there, just knowing the column names wold be a huge win. Same for other formats... Potentially we could allow each storage plugin to overide DESCRIBE table that returns the basics drill need to return... name, type (ANY or ...) and then other fields people who know data better than I do could handle... The reason this is frustrating in looking at superset (and I would imagine other tools) is that we are now providing multiple behaviors dependent on the access method to drill... I.e. if your initial response to my post is "John is silly, just tell him to use LIMIT 0 with a ODBC" then explain to me how JDBC works? How Rest works? How it differs? Why not have the interface be the same, i.e. DESCRIBE table which is commonly know, and provide overidable data based on the storage plugin... perhaps some JDBC connections or Mongo may provide more columns, that's fine! What ever is querying drill may be able to handle those, but what is the basic information required? By putting it into a limit 0 we have to rely on metadata which isn't consistent or exposed via all Access methods in Drill.... so lets' be consistent and produce an interface for describe! I feel like I am rallying troops for some sort of revolution, but in reality, just trying to sort out my thoughts after living in SQLAlchemy for too long! Also, if you are interested in the work with superset, I am mid doc update. I used to use pyodbc and the odbc interface, but Charles pushed me to look at the rest interface, and using some of his code, and code from drillpy and pydrill, I think this is the best way going forward (and I've gotten the most things in superset to work this way too!) https://github.com/JohnOmernik/sqlalchemy-drill DESCRIBE parquet_folder 200 { "columns" : [ ], "rows" : [ { } ] } select * from parquet_folder limit 0 200 { "columns" : [ ], "rows" : [ { } ] } select * from parquet_folder limit 1 200 { "columns": ["all", "my","columns"], "rows": [{"all":1, "my":2, "columns":"fun!"}] } On Mon, May 16, 2016 at 7:25 PM, Andries Engelbrecht < [email protected]> wrote: > John, > > Using a simple tool like squirrel and running a select * limit 0 against > parquet it is able to retrieve the column names, and if you look at the > returned metadata it identified the data type. Obviously schema change will > be a challenge to deal with with limit 0 queries. This is where views can > be used to force data types for the end tool, skip rows, substitute values, > skip added columns, etc, which makes it much better for most tools to work > with the data. > > A function like typeof does require a row to be returned to identify the > data type. > > > Ideally it will be great to do initial schema discovery with a tool, > associate the metadata with the data sources and then make it available for > wider use. A typeof histogram will be brilliant on data structures with > schema changes, that way it will be possible to see how many records are > affected of a certain type/change/etc. A endless list of possibilities. It > should be feasible to utilize Drill as the execution engine with a smart > tool on top of it to process. > > --Andries > > > > On May 16, 2016, at 4:08 PM, John Omernik <[email protected]> wrote: > > > > So how does a limit0 query return the type if I may ask? I can use > limit 0 > > queries with SqlAlchemy and Caravel, but in looking at it, I wasn't sure > > how where the types were returned in the results (unless it's in the > > underlying metadata not displayed to the user). I can dig more if you > tell > > me it's there :) > > > > On Mon, May 16, 2016 at 4:51 PM, Neeraja Rentachintala < > > [email protected]> wrote: > > > >> Both are options (and thats how the bI tools work with Drill today) > >> > >> - Views with explicit Casts - Will return schema definitions as part > show > >> schemas/describe table/show columns queries. > >> - Limit 0 queries - This will be good as well if we can modify Caravel > to > >> issue such queries (this is what Tableau does) > >> > >> For now, I think returning metadata to Caravel using the above options > will > >> be the solution. The ideal approach would be actually to have a data > >> exploration experience on raw data (without curation) within Caravel > itself > >> to create this metadata as needed. > >> > >> -Neeraja > >> > >> On Mon, May 16, 2016 at 2:45 PM, Ted Dunning <[email protected]> > >> wrote: > >> > >>> As you suggest, views are a critical way to lock down that kind of > >>> information. > >>> > >>> Select with limit 0 is often used for meta-data exploration. This is > more > >>> robust than asking about tables since not everything is necessarily > >> really > >>> in a single table. > >>> > >>> On Mon, May 16, 2016 at 2:12 PM, John Omernik <[email protected]> > wrote: > >>> > >>>> Hey all, as part of my exploration of Caravel, I realized knowing the > >>>> types of columns can be valuable... I can say create a view of a > >>> directory > >>>> of parquet allowing the "show tables" to work well, however, the type > >> for > >>>> every column is "ANY" which may work (need to tweak some things) but I > >> am > >>>> guessing may make certain down stream things in Caravel more > difficult. > >>>> > >>>> So, just thinking aloud here, would it be possible to "cast" in Views > >> to > >>>> allow the view definition to pass along type information? Even if it > >>> means > >>>> a more verbose view definition, it would be done once, and then down > >>> stream > >>>> tools like Caravel would know the types... > >>>> > >>>> Thoughts? > >>>> > >>>> John > >>>> > >>> > >> > >
