Re: Drill Views and Typed Columns
Ok, as you can see I am reviving old threads :) I am back to working with Carvel (now called Superset) with Charles, and we've moved to a python interface based on the rest API (much cleaner easier to work with, and working better out of the box!) That said, the issue of typed description of tables is still a challenge. For view that are typed, great, this is easy. The standard "DESCRIBE" table (view) produces the results in a format that makes sense to the dbapi and to caravel. I am having great success on tables that I create a view for. However, I "feel" like DESCRIBE table "should" be able to handle some other situations that given a few tunables would be better than the limit 0 method. So to demonstrate, the restapi providers no data on DESCRIBE for a parquet folder or a LIMIT 0 (See be low) the Limit 0 makes it so it returns nothing... (see my results below) Thus, while ODBC may provide more with Limit 0, instead of hacking Rest to use the weird limit 0 like ODBC, why don't we fix DESCRIBE with tunables? I.e. DESCRIBE parquet_folder... read the first parquet file in there, get the metadata, return the columns and types if other files change schema... well, the query will break too? Perhaps a read of cached meta data? Perhaps a tunable that would say read X files ( -1 being all) to return the schema of the folder. Also, DESCRIBE csv folder: Well, if columns/headers are setup, then just come back with column[0] to columns[1], if header reading for decribe is turned on, then have describe read the first line and return the data... Types? ANY... we can default there, just knowing the column names wold be a huge win. Same for other formats... Potentially we could allow each storage plugin to overide DESCRIBE table that returns the basics drill need to return... name, type (ANY or ...) and then other fields people who know data better than I do could handle... The reason this is frustrating in looking at superset (and I would imagine other tools) is that we are now providing multiple behaviors dependent on the access method to drill... I.e. if your initial response to my post is "John is silly, just tell him to use LIMIT 0 with a ODBC" then explain to me how JDBC works? How Rest works? How it differs? Why not have the interface be the same, i.e. DESCRIBE table which is commonly know, and provide overidable data based on the storage plugin... perhaps some JDBC connections or Mongo may provide more columns, that's fine! What ever is querying drill may be able to handle those, but what is the basic information required? By putting it into a limit 0 we have to rely on metadata which isn't consistent or exposed via all Access methods in Drill so lets' be consistent and produce an interface for describe! I feel like I am rallying troops for some sort of revolution, but in reality, just trying to sort out my thoughts after living in SQLAlchemy for too long! Also, if you are interested in the work with superset, I am mid doc update. I used to use pyodbc and the odbc interface, but Charles pushed me to look at the rest interface, and using some of his code, and code from drillpy and pydrill, I think this is the best way going forward (and I've gotten the most things in superset to work this way too!) https://github.com/JohnOmernik/sqlalchemy-drill DESCRIBE parquet_folder 200 { "columns" : [ ], "rows" : [ { } ] } select * from parquet_folder limit 0 200 { "columns" : [ ], "rows" : [ { } ] } select * from parquet_folder limit 1 200 { "columns": ["all", "my","columns"], "rows": [{"all":1, "my":2, "columns":"fun!"}] } On Mon, May 16, 2016 at 7:25 PM, Andries Engelbrecht < aengelbre...@maprtech.com> wrote: > John, > > Using a simple tool like squirrel and running a select * limit 0 against > parquet it is able to retrieve the column names, and if you look at the > returned metadata it identified the data type. Obviously schema change will > be a challenge to deal with with limit 0 queries. This is where views can > be used to force data types for the end tool, skip rows, substitute values, > skip added columns, etc, which makes it much better for most tools to work > with the data. > > A function like typeof does require a row to be returned to identify the > data type. > > > Ideally it will be great to do initial schema discovery with a tool, > associate the metadata with the data sources and then make it available for > wider use. A typeof histogram will be brilliant on data structures with > schema changes, that way it will be possible to see how many records are > affected of a certain type/change/etc. A endless list of possibilities. It > should be feasible to utilize Drill as the execution engine with a smart > tool on top of it to process. > > --Andries > > > > On May 16, 2016, at 4:08 PM, John Omernikwrote: > > > > So how does a limit0 query return the type if I may ask? I can use > limit 0 > > queries with SqlAlchemy and Caravel, but in looking
Re: Drill Views and Typed Columns
John, Using a simple tool like squirrel and running a select * limit 0 against parquet it is able to retrieve the column names, and if you look at the returned metadata it identified the data type. Obviously schema change will be a challenge to deal with with limit 0 queries. This is where views can be used to force data types for the end tool, skip rows, substitute values, skip added columns, etc, which makes it much better for most tools to work with the data. A function like typeof does require a row to be returned to identify the data type. Ideally it will be great to do initial schema discovery with a tool, associate the metadata with the data sources and then make it available for wider use. A typeof histogram will be brilliant on data structures with schema changes, that way it will be possible to see how many records are affected of a certain type/change/etc. A endless list of possibilities. It should be feasible to utilize Drill as the execution engine with a smart tool on top of it to process. --Andries > On May 16, 2016, at 4:08 PM, John Omernikwrote: > > So how does a limit0 query return the type if I may ask? I can use limit 0 > queries with SqlAlchemy and Caravel, but in looking at it, I wasn't sure > how where the types were returned in the results (unless it's in the > underlying metadata not displayed to the user). I can dig more if you tell > me it's there :) > > On Mon, May 16, 2016 at 4:51 PM, Neeraja Rentachintala < > nrentachint...@maprtech.com> wrote: > >> Both are options (and thats how the bI tools work with Drill today) >> >> - Views with explicit Casts - Will return schema definitions as part show >> schemas/describe table/show columns queries. >> - Limit 0 queries - This will be good as well if we can modify Caravel to >> issue such queries (this is what Tableau does) >> >> For now, I think returning metadata to Caravel using the above options will >> be the solution. The ideal approach would be actually to have a data >> exploration experience on raw data (without curation) within Caravel itself >> to create this metadata as needed. >> >> -Neeraja >> >> On Mon, May 16, 2016 at 2:45 PM, Ted Dunning >> wrote: >> >>> As you suggest, views are a critical way to lock down that kind of >>> information. >>> >>> Select with limit 0 is often used for meta-data exploration. This is more >>> robust than asking about tables since not everything is necessarily >> really >>> in a single table. >>> >>> On Mon, May 16, 2016 at 2:12 PM, John Omernik wrote: >>> Hey all, as part of my exploration of Caravel, I realized knowing the types of columns can be valuable... I can say create a view of a >>> directory of parquet allowing the "show tables" to work well, however, the type >> for every column is "ANY" which may work (need to tweak some things) but I >> am guessing may make certain down stream things in Caravel more difficult. So, just thinking aloud here, would it be possible to "cast" in Views >> to allow the view definition to pass along type information? Even if it >>> means a more verbose view definition, it would be done once, and then down >>> stream tools like Caravel would know the types... Thoughts? John >>> >>
Re: Drill Views and Typed Columns
So how does a limit0 query return the type if I may ask? I can use limit 0 queries with SqlAlchemy and Caravel, but in looking at it, I wasn't sure how where the types were returned in the results (unless it's in the underlying metadata not displayed to the user). I can dig more if you tell me it's there :) On Mon, May 16, 2016 at 4:51 PM, Neeraja Rentachintala < nrentachint...@maprtech.com> wrote: > Both are options (and thats how the bI tools work with Drill today) > > - Views with explicit Casts - Will return schema definitions as part show > schemas/describe table/show columns queries. > - Limit 0 queries - This will be good as well if we can modify Caravel to > issue such queries (this is what Tableau does) > > For now, I think returning metadata to Caravel using the above options will > be the solution. The ideal approach would be actually to have a data > exploration experience on raw data (without curation) within Caravel itself > to create this metadata as needed. > > -Neeraja > > On Mon, May 16, 2016 at 2:45 PM, Ted Dunning> wrote: > > > As you suggest, views are a critical way to lock down that kind of > > information. > > > > Select with limit 0 is often used for meta-data exploration. This is more > > robust than asking about tables since not everything is necessarily > really > > in a single table. > > > > On Mon, May 16, 2016 at 2:12 PM, John Omernik wrote: > > > > > Hey all, as part of my exploration of Caravel, I realized knowing the > > > types of columns can be valuable... I can say create a view of a > > directory > > > of parquet allowing the "show tables" to work well, however, the type > for > > > every column is "ANY" which may work (need to tweak some things) but I > am > > > guessing may make certain down stream things in Caravel more difficult. > > > > > > So, just thinking aloud here, would it be possible to "cast" in Views > to > > > allow the view definition to pass along type information? Even if it > > means > > > a more verbose view definition, it would be done once, and then down > > stream > > > tools like Caravel would know the types... > > > > > > Thoughts? > > > > > > John > > > > > >
Re: Drill Views and Typed Columns
Both are options (and thats how the bI tools work with Drill today) - Views with explicit Casts - Will return schema definitions as part show schemas/describe table/show columns queries. - Limit 0 queries - This will be good as well if we can modify Caravel to issue such queries (this is what Tableau does) For now, I think returning metadata to Caravel using the above options will be the solution. The ideal approach would be actually to have a data exploration experience on raw data (without curation) within Caravel itself to create this metadata as needed. -Neeraja On Mon, May 16, 2016 at 2:45 PM, Ted Dunningwrote: > As you suggest, views are a critical way to lock down that kind of > information. > > Select with limit 0 is often used for meta-data exploration. This is more > robust than asking about tables since not everything is necessarily really > in a single table. > > On Mon, May 16, 2016 at 2:12 PM, John Omernik wrote: > > > Hey all, as part of my exploration of Caravel, I realized knowing the > > types of columns can be valuable... I can say create a view of a > directory > > of parquet allowing the "show tables" to work well, however, the type for > > every column is "ANY" which may work (need to tweak some things) but I am > > guessing may make certain down stream things in Caravel more difficult. > > > > So, just thinking aloud here, would it be possible to "cast" in Views to > > allow the view definition to pass along type information? Even if it > means > > a more verbose view definition, it would be done once, and then down > stream > > tools like Caravel would know the types... > > > > Thoughts? > > > > John > > >
Re: Drill Views and Typed Columns
As you suggest, views are a critical way to lock down that kind of information. Select with limit 0 is often used for meta-data exploration. This is more robust than asking about tables since not everything is necessarily really in a single table. On Mon, May 16, 2016 at 2:12 PM, John Omernikwrote: > Hey all, as part of my exploration of Caravel, I realized knowing the > types of columns can be valuable... I can say create a view of a directory > of parquet allowing the "show tables" to work well, however, the type for > every column is "ANY" which may work (need to tweak some things) but I am > guessing may make certain down stream things in Caravel more difficult. > > So, just thinking aloud here, would it be possible to "cast" in Views to > allow the view definition to pass along type information? Even if it means > a more verbose view definition, it would be done once, and then down stream > tools like Caravel would know the types... > > Thoughts? > > John >
Drill Views and Typed Columns
Hey all, as part of my exploration of Caravel, I realized knowing the types of columns can be valuable... I can say create a view of a directory of parquet allowing the "show tables" to work well, however, the type for every column is "ANY" which may work (need to tweak some things) but I am guessing may make certain down stream things in Caravel more difficult. So, just thinking aloud here, would it be possible to "cast" in Views to allow the view definition to pass along type information? Even if it means a more verbose view definition, it would be done once, and then down stream tools like Caravel would know the types... Thoughts? John