1. Yes, certainly the cost estimation is applied for file system sources
as well. For Parquet the row count is
extracted from the metadata available in the files. For text (csv,
json), this is estimated based on the size
of the files and a configurable *store.text.estimated_row_size_bytes*
parameter. The default value is 100
bytes per row, but you can change it.
2. There are 2 things in your query that need special handling in Drill
because it does not rely on a centralized
schema.
First, the presence of '*' wildcard in the SELECT list - the '*' is
expanded at run-time unlike the schema-based
systems where it is expanded at the initial query planning time.
Second, whenever there is a join between 2
or more tables, the columns in the query need to be qualified by the
table name because a column could
potentially be present in either of the tables.
The following query which explicitly specifies the qualified columns
in SELECT clause should work:
select r_regionkey from (
select r.r_regionkey, n.n_regionkey from region r join nation n
on r.r_regionkey = n.n_regionkey)
3. You might want to file a JIRA with the exact reproducible test case.
4. I believe the link you sent for the plan document is quite old. The
best place for the latest documentation is
drill.apache.org .
Non-equality joins are supported in 2 situations:
a) if there is another equality join between the same 2 tables.
e.g
select n1.n_nationkey from nation n1, nation n2
where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
n2.n_regionkey
b) if one side of the join is a scalar subquery:
select n1.n_nationkey from nation n1, (select min(n_regionkey)
as min_regionkey from nation n) n2
where n1.n_regionkey = n2.min_regionkey;
Pure cartesian joins are currently not supported.
Aman
On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <[email protected]>
wrote:
> Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
> questions/issues:
>
> 1. For query cost estimation one usually needs some additional information
> about a table such as the number of rows. Is the cost estimation
> implemented for fs sources as well? If yes, how is the metadata extracted
> and cached? From my understanding some formats like parquet store it in the
> file footer, but what about json or csv files? Can this information be
> queried/retrieved somehow by the user?
>
> 2. I’ve been working with the following query:
>
> $q = select * from region join nation on region.R_REGIONKEY =
> nation.N_REGIONKEY;
>
> where region and nation are the sample data files imported into a dfs.tmp
> schema.
> running queries like
>
> select R_REGIONKEY from ($q);
>
> results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
> like select R_REGIONKEY from (SELECT * FROM region); work fine, as well as
> saving the result of the join with CREATE TABLE and then replacing $q with
> the saved table’s name. Why is that and what are the rules for renaming
> columns in join queries?
>
> 3. I’ve been trying to execute a logical plan using the web interface. It
> works fine with a simple scan - project query, but when trying to use the
> output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
> the following error:
>
> SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
> property-based creators: already had [constructor for
> org.apache.drill.common.logical.data.Join, ...
>
> the whole logical query and full error message are at
> https://gist.github.com/pyetras/bf625b6697de62284996
>
> 4. What are the supported conditions for joins? The sql interface seems to
> support only (e1 == e2 [AND])*, but the logical operator reference at
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
> mentions other relations and also cartesian joins. Are those simply not
> implemented for the sql parser or not supported in Drill at all?
>
> Sorry for the long read and thanks for your assistance,
>
> --
> Piotr Sokólski
>
>