Re: Drill Views and Typed Columns

John Omernik Wed, 01 Mar 2017 10:23:52 -0800

Ok, as you can see I am reviving old threads :)
I am back to working with Carvel (now called Superset) with Charles, and
we've moved to a python interface based on the rest API (much cleaner
easier to work with, and working better out of the box!)

That said, the issue of typed description of tables is still a challenge.

For view that are typed, great, this is easy. The standard "DESCRIBE" table
(view) produces the results in a format that makes sense to the dbapi and
to caravel. I am having great success on tables that I create a view for.

However, I "feel" like DESCRIBE table "should" be able to handle some other
situations that given a few tunables would be better than the limit 0
method.

So to demonstrate, the restapi providers no data on DESCRIBE for a parquet
folder or a LIMIT 0 (See be low) the Limit 0 makes it so it returns
nothing... (see my results below)

Thus, while ODBC may provide more with Limit 0, instead of hacking Rest to
use the weird limit 0 like ODBC, why don't we fix DESCRIBE with tunables?

I.e.

DESCRIBE parquet_folder... read the first parquet file in there, get the
metadata, return the columns and types.... if other files change schema...
well, the query will break too?  Perhaps a read of cached meta data?
Perhaps a tunable that would say read X files ( -1 being all) to return the
schema of the folder.

Also, DESCRIBE csv folder: Well, if columns/headers are setup, then just
come back with column[0] to columns[1], if header reading for decribe is
turned on, then have describe read the first line and return the data...
Types? ANY... we can default there, just knowing the column names wold be a
huge win.  Same for other formats...

Potentially we could allow each storage plugin to overide DESCRIBE table
that returns the basics drill need to return... name, type (ANY or ...)
 and then other fields people who know data better than I do could
handle...

The reason this is frustrating in looking at superset (and I would imagine
other tools) is that we are now providing multiple behaviors dependent on
the access method to drill... I.e. if your initial response to my post is
"John is silly, just tell him to use LIMIT 0 with a ODBC" then explain to
me how JDBC works? How Rest works? How it differs?  Why not have the
interface be the same, i.e. DESCRIBE table which is commonly know, and
provide overidable data based on the storage plugin... perhaps some JDBC
connections or Mongo may provide more columns, that's fine!  What ever is
querying drill may be able to handle those, but what is the basic
information required?  By putting it into a limit 0 we have to rely on
metadata which isn't consistent or exposed via all Access methods in
Drill.... so lets' be consistent and produce an interface for describe!

I feel like I am rallying troops for some sort of revolution, but in
reality, just trying to sort out my thoughts after living in SQLAlchemy for
too long!

Also, if you are interested in the work with superset, I am mid doc update.
I used to use pyodbc and the odbc interface, but Charles pushed me to look
at the rest interface, and using some of his code, and code from drillpy
and pydrill, I think this is the best way going forward (and I've gotten
the most things in superset to work this way too!)

https://github.com/JohnOmernik/sqlalchemy-drill

DESCRIBE parquet_folder
200
{
  "columns" : [ ],
  "rows" : [ { } ]
}

select * from parquet_folder limit 0
200
{
  "columns" : [ ],
  "rows" : [ { } ]
}

select * from parquet_folder limit 1
200
{
"columns": ["all", "my","columns"],
"rows": [{"all":1, "my":2, "columns":"fun!"}]
}

On Mon, May 16, 2016 at 7:25 PM, Andries Engelbrecht <
[email protected]> wrote:

> John,
>
> Using a simple tool like squirrel and running a select * limit 0 against
> parquet it is able to retrieve the column names, and if you look at the
> returned metadata it identified the data type. Obviously schema change will
> be a challenge to deal with with limit 0 queries. This is where views can
> be used to force data types for the end tool, skip rows, substitute values,
> skip added columns, etc, which makes it much better for most tools to work
> with the data.
>
> A function like typeof does require a row to be returned to identify the
> data type.
>
>
> Ideally it will be great to do initial schema discovery with a tool,
> associate the metadata with the data sources and then make it available for
> wider use. A typeof histogram will be brilliant on data structures with
> schema changes, that way it will be possible to see how many records are
> affected of a certain type/change/etc. A endless list of possibilities. It
> should be feasible to utilize Drill as the execution engine with a smart
> tool on top of it to process.
>
> --Andries
>
>
> > On May 16, 2016, at 4:08 PM, John Omernik <[email protected]> wrote:
> >
> > So how does a limit0 query return the type if I may ask?  I can use
> limit 0
> > queries with SqlAlchemy and Caravel, but in looking at it, I wasn't sure
> > how where the types were returned in the results (unless it's in the
> > underlying metadata not displayed to the user).  I can dig more if you
> tell
> > me it's there :)
> >
> > On Mon, May 16, 2016 at 4:51 PM, Neeraja Rentachintala <
> > [email protected]> wrote:
> >
> >> Both are options (and thats how the bI tools work with Drill today)
> >>
> >> - Views with explicit Casts - Will return schema definitions as part
> show
> >> schemas/describe table/show columns queries.
> >> - Limit 0 queries - This will be good as well if we can modify Caravel
> to
> >> issue such queries (this is what Tableau does)
> >>
> >> For now, I think returning metadata to Caravel using the above options
> will
> >> be the solution. The ideal approach would be actually to have a  data
> >> exploration experience on raw data (without curation) within Caravel
> itself
> >> to create this metadata as needed.
> >>
> >> -Neeraja
> >>
> >> On Mon, May 16, 2016 at 2:45 PM, Ted Dunning <[email protected]>
> >> wrote:
> >>
> >>> As you suggest, views are a critical way to lock down that kind of
> >>> information.
> >>>
> >>> Select with limit 0 is often used for meta-data exploration. This is
> more
> >>> robust than asking about tables since not everything is necessarily
> >> really
> >>> in a single table.
> >>>
> >>> On Mon, May 16, 2016 at 2:12 PM, John Omernik <[email protected]>
> wrote:
> >>>
> >>>> Hey all, as part of my exploration of Caravel,  I realized knowing the
> >>>> types of columns can be valuable... I can say create a view of a
> >>> directory
> >>>> of parquet allowing the "show tables" to work well, however, the type
> >> for
> >>>> every column is "ANY" which may work (need to tweak some things) but I
> >> am
> >>>> guessing may make certain down stream things in Caravel more
> difficult.
> >>>>
> >>>> So, just thinking aloud here, would it be possible to "cast" in Views
> >> to
> >>>> allow the view definition to pass along type information?  Even if it
> >>> means
> >>>> a more verbose view definition, it would be done once, and then down
> >>> stream
> >>>> tools like Caravel would know the types...
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> John
> >>>>
> >>>
> >>
>
>

Re: Drill Views and Typed Columns

Reply via email to