Re: Estimating cost for fs tables, table names after joins and executing logical plans

Jacques Nadeau Sat, 23 May 2015 12:38:25 -0700

One other note:

The default explain plan is a physical plan.  If you want to get a logical
plan,  you need to do EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR ...


If you're still getting the error around plan consumption, please file a
jira.
On May 23, 2015 10:28 AM, "Aman Sinha" <[email protected]> wrote:

> 1.  Yes, certainly the cost estimation is applied for file system sources
> as well.  For Parquet the row count is
>       extracted from the metadata available in the files.  For text (csv,
> json), this is estimated based on the size
>       of the files and a configurable *store.text.estimated_row_size_bytes*
> parameter.   The default value is 100
>       bytes per row, but you can change it.
>
> 2.  There are 2 things in your query that need special handling in Drill
> because it does not rely on a centralized
>       schema.
>       First, the presence of '*' wildcard in the SELECT list - the '*' is
> expanded at run-time unlike the schema-based
>      systems where it is expanded at the initial query planning time.
> Second, whenever there is a join between 2
>       or more tables, the columns in the query need to be qualified by the
> table name because a column  could
>       potentially be present in either of the tables.
>       The following query which explicitly specifies the qualified columns
> in SELECT clause should work:
>        select r_regionkey from (
>            select r.r_regionkey, n.n_regionkey from region r join nation n
> on r.r_regionkey = n.n_regionkey)
>
>  3.  You might want to file a JIRA with the exact reproducible test case.
>  4.   I believe the link you sent for the plan document is quite old.  The
> best place for the latest documentation is
>        drill.apache.org .
>        Non-equality joins are supported in 2 situations:
>         a)  if there is another equality join between the same 2 tables.
>  e.g
>            select n1.n_nationkey from nation n1, nation n2
>                where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
> n2.n_regionkey
>         b) if one side of the join is a scalar subquery:
>             select n1.n_nationkey from nation n1, (select min(n_regionkey)
> as min_regionkey from nation n) n2
>                where n1.n_regionkey = n2.min_regionkey;
>
>          Pure cartesian joins are currently not supported.
>
> Aman
>
>
> On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <[email protected]>
> wrote:
>
> > Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
> > questions/issues:
> >
> > 1. For query cost estimation one usually needs some additional
> information
> > about a table such as the number of rows. Is the cost estimation
> > implemented for fs sources as well? If yes, how is the metadata extracted
> > and cached? From my understanding some formats like parquet store it in
> the
> > file footer, but what about json or csv files? Can this information be
> > queried/retrieved somehow by the user?
> >
> > 2. I’ve been working with the following query:
> >
> > $q = select * from region join nation on region.R_REGIONKEY =
> > nation.N_REGIONKEY;
> >
> > where region and nation are the sample data files imported into a dfs.tmp
> > schema.
> > running queries like
> >
> > select R_REGIONKEY from ($q);
> >
> > results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
> > like select R_REGIONKEY from (SELECT * FROM region); work fine, as well
> as
> > saving the result of the join with CREATE TABLE and then replacing $q
> with
> > the saved table’s name. Why is that and what are the rules for renaming
> > columns in join queries?
> >
> > 3. I’ve been trying to execute a logical plan using the web interface. It
> > works fine with a simple scan - project query, but when trying to use the
> > output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
> > the following error:
> >
> > SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
> > property-based creators: already had [constructor for
> > org.apache.drill.common.logical.data.Join,  ...
> >
> > the whole logical query and full error message are at
> > https://gist.github.com/pyetras/bf625b6697de62284996
> >
> > 4. What are the supported conditions for joins? The sql interface seems
> to
> > support only (e1 == e2 [AND])*, but the logical operator reference at
> >
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
> > mentions other relations and also cartesian joins. Are those simply not
> > implemented for the sql parser or not supported in Drill at all?
> >
> > Sorry for the long read and thanks for your assistance,
> >
> > --
> > Piotr Sokólski
> >
> >
>

Re: Estimating cost for fs tables, table names after joins and executing logical plans

Reply via email to