from:"Zelaine Fong"

Re: Drill selects column with the same name of a different table

2017-08-21 Thread Zelaine Fong

There was a similar (but not identical) issue reported as 
https://issues.apache.org/jira/browse/DRILL-4374.  In that case, an error was 
returned, but the root cause seemed to stem from using the Postgres JDBC 
storage plugin to access tables with the same column name.  From the comment 
thread on the bug, it looks like the bug wasn’t reproducible.  So, if you have 
a consistent repro with schema and data, it would be good to log a new bug with 
that repro info.

-- Zelaine

On 8/21/17, 3:22 PM, "Kunal Khatua"  wrote:

Could you share the profile ( *.sys.drill file or the 
http://:8047/profiles/.json ) ?

This might be a bug with the JDBC Storage plugin. 

A quick way to validate this would be to have the similar data as 2 
text/parquet tables and have Drill read from that. If we don't see an issue, 
then it is most likely a storage plugin bug, for which you should file a JIRA.



-Original Message-
From: Hielke Hoeve [mailto:hielke.ho...@topicus.nl] 
Sent: Monday, August 21, 2017 7:05 AM
To: user@drill.apache.org
Subject: Drill selects column with the same name of a different table 

Hi everyone. I just started using Drill/Zeppelin and I’m facing a strange 
problem. I have a PostgreSQL database linked to a Drill instance. Whenever I am 
trying to join 2 tables which both have a column name and whenever I want to 
select this name Drill selects the wrong name column. What am I doing wrong?

Given the following 2 tables:

Department
| id | name |
||--|
| 1  |  A   |
| 2  |  B   |

Employee
| id | name | dept | salary |
||--|--||
| 1  |  U   |  1   |  100   |
| 2  |  V   |  1   |   75   |
| 3  |  W   |  1   |  120   |
| 4  |  X   |  2   |   95   |
| 5  |  Y   |  2   |  140   |
| 6  |  Z   |  2   |   55   |

Running

select employee.name, employee.salary
from employee
inner join department on employee.dept = department.id where 
department.name = 'A'

returns

| name | salary |
|--||
|  A   |  100   |
|  A   |  75|
|  A   |  120   |

Running

select dept.name, employee.salary
from employee
inner join department on employee.dept = department.id where 
department.name = 'A'

returns

| name | salary |
|--||
| null |  100   |
| null |  75|
| null |  120   |


What does work, but seems very silly to me, is:

select dept.name, employee.salary
from employee
inner join (select id, name as deptname from department) as department on 
employee.dept = department.id where department.deptname = 'A'

Re: Running cartesian joins on Drill

2017-05-11 Thread Zelaine Fong

I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;
+---+-+
|  ok   | summary |
+---+-+
| true  | planner.enable_nljoin_for_scalar_only updated.  |
+---+-+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  ProjectAllowDup(*=[$0], *0=[$1])
00-02NestedLoopJoin(condition=[true], joinType=[inner])
00-04  Project(T2¦¦*=[$0])
00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])
00-03  Project(T3¦¦*=[$0])
00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:

But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> wrote:
>
> Hi,
>
> I have one question here.. so if we have to use Cartesian join in 
Drill
> then do we have to follow some workaround like Shadi mention : adding 
a
> dummy column on the fly that has the value 1 in both tables and then
> join
> on that column leading to having a match of every row of the first
> table
> with every row of the second table, hence do a Cartesian product?
> OR
> If we just don't specify join condition like :
> select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
> query as Cartesian join.
>
> Regards,
> *Anup Tiwari*
>
> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:
>
> > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
> > you should see that reflected in the resultant query plan when you
> run
> > explain plan on the query.
> >
> > Yes, Cartesian joins/nested loop joins are expensive because you’re
> > effectively doing an MxN read of your tables.  There are more
> efficient
> > ways of processing a nested loop join, e.g., by creating an index on
> the
> > larger table in the join and then using that index to do lookups
> into that
> > table.  That way, the nested loop join cost is the cost of creating
> the
> > index + M, where M is the number of rows in the smaller table and
> assuming
> > the lookup cost into the index does minimize the amount of data read
> of the
> > second table.  Drill currently doesn’t do this.
> >
> > -- Zelaine
> >
> > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:
> >
> > I believe clhubert is referring to this discussion
> > <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
> > cartesian-product-in-apache-drill#post1>
> > .
> >
> > So why Drill doesn't transform this query into a nested join
> query ?
> > Simply
> > because there is no Calcite rule to transform it into a nested
> loop
> > join ?
> > Is it not technically possible to write such Rule or is it
> feasible so
> > I
> > may take on this challenge ?
> >
> > Also pardon me for repeating my question but I fail to find an
> answer
> > in
> > your replies, why doesn't Drill just

Re: Running cartesian joins on Drill

2017-05-11 Thread Zelaine Fong

Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without 
an explicit join condition, the query should use the Cartesian join/nested loop 
join.

-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> wrote:

Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of 
the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:
>
> I believe clhubert is referring to this discussion
> <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
> cartesian-product-in-apache-drill#post1>
> .
>
> So why Drill doesn't transform this query into a nested join query ?
> Simply
> because there is no Calcite rule to transform it into a nested loop
> join ?
> Is it not technically possible to write such Rule or is it feasible so
> I
> may take on this challenge ?
>
> Also pardon me for repeating my question but I fail to find an answer
> in
> your replies, why doesn't Drill just run a cartesian join ? Because
> it's
> expensive regarding resources (i.e. CPU\Network\RAM) ?
>
> Thanks a lot Shadi for the query, it works for me.
>
> *-*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <khal...@cs.queensu.ca>
> wrote:
>
> > Hi Muhammad,
> >
> > I did the following as a workaround to have Cartesian product. The
> basic
> > idea is to create a dummy column on the fly that has the value 1 in
> both
> > tables and then join on that column leading to having a match of
> every row
> > of the first table with every row of the second table, hence do a
> Cartesian
> > product. This might not be the most efficient way but it will do the
> job.
> >
> > *Original Query:*
> > SELECT * FROM
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t0`
> > INNER JOIN
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t1`
> > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> > LIMIT 2147483647
> >
> > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
> > and two, respectively. Names don't really matter, just need to be
> unique):*
> > SELECT * FROM
> > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> > INNER JOIN
> > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> > LIMIT 2147483647
> >
> > Regards
> >
> >
> > *Shadi Khalifa, PhD*
> > Postdoctoral Fellow
> > Cogn

Re: querying from multiple directories in S3

2017-05-10 Thread Zelaine Fong

Drill will only scan the files in the 2017/03 directory.  See 
https://drill.apache.org/docs/how-to-partition-data/, which describes an 
example very similar to your use case.

-- Zelaine

On 5/10/17, 10:32 AM, "Wesley Chow"  wrote:

Suppose that I have a directory structure in S3 like so:

root//MM/{lots of files}

Where  and MM are year and month numbers. If I run a query like:

SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';

Does Drill do a scan to find all files in root, thus picking up files from
2016, and then filter them down to ones matching dir0='2017' and dir1='03'
before reading the data? That's what I meant by "scan all the files." Or
does Drill know that it only has to do a scan of files in the 2017/03
directory?

Wes


On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi  wrote:

> I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 =
> 14 AND dir3 > 4)",  and of course you need to add dir0 and dir1 for year
> and month.
>
>
> And what do you mean by "scan all the files on every query", scan all the
> files of one day data, I thought this was your purpose?
>
> 
> From: Wesley Chow 
> Sent: Wednesday, May 10, 2017 9:04:12 AM
> To: user@drill.apache.org
> Subject: Re: querying from multiple directories in S3
>
> I don't think so, because doesn't AND commute, which would mean dir2 = 15
> AND dir2=14 would always be false?
>
> Even if there is some comparison that works, isn't there still an issue
> that the S3 file source has to scan all the files on every query?
>
> Wes
>
> On Wed, May 10, 2017 at 8:15 AM, Charles Givre  wrote:
>
> > Hi Wes,
> > Are you putting the dirX fields in the WHERE clause?
> > IE  Couldn't you do soemthing like:
> >
> > SELECT  
> > FROM s3.data
> > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
> >
> > In theory this could work for UTC -4.  It’s ugly… but I think it would
> > work.
> > — C
> >
> >
> >
> > > On May 9, 2017, at 10:06, Wesley Chow  wrote:
> > >
> > > What is the recommended way to issue a query against a large number of
> > > tables in S3? At the moment I'm aliasing the table as a giant UNION
> ALL,
> > > but is there a better way to do this?
> > >
> > > Our data is stored as a time hierarchy, like /MM/DD/HH/MM in UTC,
> but
> > > unfortunately I can't simply run the query recursively on an entire 
day
> > of
> > > data. I usually need a day of data in a non-UTC time zone. Is there
> some
> > > elegant way to grab that data using the dir0, dir1 magic columns?
> > >
> > > Thanks,
> > > Wes
> >
> >
>

Re: Running cartesian joins on Drill

2017-05-08 Thread Zelaine Fong

Cartesian joins in Drill are implemented as nested loop joins, and I think you 
should see that reflected in the resultant query plan when you run explain plan 
on the query.

Yes, Cartesian joins/nested loop joins are expensive because you’re effectively 
doing an MxN read of your tables.  There are more efficient ways of processing 
a nested loop join, e.g., by creating an index on the larger table in the join 
and then using that index to do lookups into that table.  That way, the nested 
loop join cost is the cost of creating the index + M, where M is the number of 
rows in the smaller table and assuming the lookup cost into the index does 
minimize the amount of data read of the second table.  Drill currently doesn’t 
do this.

-- Zelaine

On 5/8/17, 9:09 AM, "Muhammad Gelbana"  wrote:

I believe clhubert is referring to this discussion

.

So why Drill doesn't transform this query into a nested join query ? Simply
because there is no Calcite rule to transform it into a nested loop join ?
Is it not technically possible to write such Rule or is it feasible so I
may take on this challenge ?

Also pardon me for repeating my question but I fail to find an answer in
your replies, why doesn't Drill just run a cartesian join ? Because it's
expensive regarding resources (i.e. CPU\Network\RAM) ?

Thanks a lot Shadi for the query, it works for me.

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa  wrote:

> Hi Muhammad,
>
> I did the following as a workaround to have Cartesian product. The basic
> idea is to create a dummy column on the fly that has the value 1 in both
> tables and then join on that column leading to having a match of every row
> of the first table with every row of the second table, hence do a 
Cartesian
> product. This might not be the most efficient way but it will do the job.
>
> *Original Query:*
> SELECT * FROM
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t0`
> INNER JOIN
> ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> 2147483647) `t1`
> ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> LIMIT 2147483647
>
> *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables one
> and two, respectively. Names don't really matter, just need to be 
unique):*
> SELECT * FROM
> ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> INNER JOIN
> ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> LIMIT 2147483647
>
> Regards
>
>
> *Shadi Khalifa, PhD*
> Postdoctoral Fellow
> Cognitive Analytics Development Hub
> Centre for Advanced Computing
> Queen’s University
> (613) 533-6000 x78347
> http://cac.queensu.ca
>
> I'm just a neuron in the society collective brain
>
> *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
> *
>
> P Please consider your environmental responsibility before printing this
> e-mail
>
> *01001001 0010 01101100 0110 01110110 01100101 0010 01000101
> 01100111 0001 0111 01110100 *
>
> *The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential material. Any review or
> dissemination of this information by persons other than the intended
> recipient is prohibited. If you received this in error, please contact the
> sender and delete the material from any computer. Thank you.*
>
>
>
> On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana 
> wrote:
>
>
> 
> Here it is:
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
> 
> `t0`.`UserID` IS NOT DISTINCT FROM
> 
> `t1`.`UserID`) LIMIT 2147483647
>
> I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
> 
> *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL**)* while checking if the query is a cartesian join, and when the
> check returns true, it throws an excetion saying: *This query cannot be
> planned possibly due to either a cartesian join or an inequality join*
>
>
>

Re: External Sort - Unable to Allocate Buffer error

2017-05-01 Thread Zelaine Fong

Nate,

The Jira you’ve referenced relates to the new external sort, which is not 
enabled by default, as it is still going through some additional testing.  If 
you’d like to try it to see if it resolves your problem, you’ll need to 
set “sort.external.disable_managed” as follows  in your drill-override.conf 
file:

drill.exec: {
  cluster-id: "drillbits1",
  zk.connect: "localhost:2181",
  sort.external.disable_managed: false
}

and run the following query:

ALTER SESSION SET `exec.sort.disable_managed` = false;

-- Zelaine

On 5/1/17, 7:44 AM, "Nate Butler"  wrote:

We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.

Our environment:

- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV

I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.

We've applied the patch from this bug in the hopes that it would resolve
the issue but it hasn't:

https://issues.apache.org/jira/browse/DRILL-5226

Stack Trace:

  (org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379

org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.next():75


org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill():602

org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109


org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():137
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext():144
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)

Is there something I'm missing here? Any help/direction would be
appreciated.

Thanks,
Nate

Re: JDBC SQL parse error on CTAS

2017-04-25 Thread Zelaine Fong

Yes, being “schema free”, the LIMIT 0 is issued internally by Drill as part of 
the prepare so Drill will return schema information for the query without 
actually fetching any data.  

If you can avoid the prepare of the CTAS statement, that would avoid the 
problem.

-- Zelaine

On 4/25/17, 3:44 PM, "Wesley Chow" <w...@chartbeat.com> wrote:

Is the point in the LIMIT 0 to trick the planner into not executing the
statement in some way? When I apply the patch to not wrap CTAS with a
LIMIT0 query I get "table exists" errors, which implies that the query is
getting executed twice. Just trying to figure this out so I know if I
should fix this on the Drill side or work around it by avoiding prepared
statements...

Thanks,
Wes


On Mon, Apr 24, 2017 at 9:38 PM, Wesley Chow <w...@chartbeat.com> wrote:

> Yes that sounds like it. I'm looking at this patch for the corresponding
> issue:
>
> https://github.com/apache/drill/pull/698/files#diff-
> 6b274c93088e03d7ee1d7d266fcf4ac8R133
>
> And I don't understand why the issue says that this patch only handles the
> "show schema" case, but it looks like it handles any case that isn't
> SELECT. That said, for it to be completely correct it should also handle
> UPDATEs, right? I'm admittedly unclear on what the purpose of the "LIMIT 
0"
> is...
>
> I'll try applying this patch to see if it fixes my issue.
    >
> Thanks,
> Wes
>
>
> On Mon, Apr 24, 2017 at 5:49 PM, Zelaine Fong <zf...@mapr.com> wrote:
>
>> I suspect you’ve hit https://issues.apache.org/jira/browse/DRILL-5136.
>> See the comment trail which mentions the problem with CTAS.
>>
>> -- Zelaine
>>
>> On 4/24/17, 2:45 PM, "Kunal Khatua" <kkha...@mapr.com> wrote:
>>
>> I'm not sure if you can do a
>>
>> select * from (CTAS ..)
>>
>> The output of CTAS really is only a metric of the rows that got
>> written by the various fragments. Are you able to run the query in 
SQLLine?
>>
>> The parse error does a pretty good job of pointing out where the
>> query appears to have broken the parser's semantics.  So it is possible
>> that you need some escape characters to handle the query, without which,
>> Python might be altering the query.
>>
>>
>> Kunal
>>
>> 
>> From: Wesley Chow <w...@chartbeat.com>
>> Sent: Monday, April 24, 2017 2:33:14 PM
>> To: user@drill.apache.org
>> Subject: JDBC SQL parse error on CTAS
>>
>> I'm seeing a funny issue where issuing SELECT statements through the
>> JDBC
>> driver is fine, but when I prepend them with CREATE TABLE AS, I start
>> getting parse errors. If I copy+paste the query string into sqlline
>> the
>> CTAS works. When the thing breaks, it gives me an exception with
>> message
>> like:
>>
>> SQL Query SELECT * FROM (CREATE TABLE foo ... ) LIMIT 0
>>
>> And complains about a parse error at the CREATE TABLE step. Now this
>> is
>> weird to me, because it seems like the driver is automatically
>> wrapping the
>> CTAS with a "SELECT * FROM ... LIMIT 0"?
>>
>> An additional complication is that this isn't straight up JDBC. This
>> is
>> Python calling out to the JDBC driver, so admittedly I might be
>> throwing a
>> wrench in there somewhere. But it doesn't make sense to me that
>> either the
>> Python layer or JDBC would modify the query in that way. I've checked
>> the
>> Python wrapper source and there I couldn't find anything to indicate
>> it
>> might be changing the query. I also briefly sifted through the JDBC
>> code
>> and couldn't find anything as well.
>>
>> Any clues as to what might be going on?
>>
>> Wes
>>
>>
>>
>

Re: JDBC SQL parse error on CTAS

2017-04-24 Thread Zelaine Fong

I suspect you’ve hit https://issues.apache.org/jira/browse/DRILL-5136.  See the 
comment trail which mentions the problem with CTAS.

-- Zelaine

On 4/24/17, 2:45 PM, "Kunal Khatua"  wrote:

I'm not sure if you can do a

select * from (CTAS ..)

The output of CTAS really is only a metric of the rows that got written by 
the various fragments. Are you able to run the query in SQLLine?

The parse error does a pretty good job of pointing out where the query 
appears to have broken the parser's semantics.  So it is possible that you need 
some escape characters to handle the query, without which, Python might be 
altering the query.


Kunal


From: Wesley Chow 
Sent: Monday, April 24, 2017 2:33:14 PM
To: user@drill.apache.org
Subject: JDBC SQL parse error on CTAS

I'm seeing a funny issue where issuing SELECT statements through the JDBC
driver is fine, but when I prepend them with CREATE TABLE AS, I start
getting parse errors. If I copy+paste the query string into sqlline the
CTAS works. When the thing breaks, it gives me an exception with message
like:

SQL Query SELECT * FROM (CREATE TABLE foo ... ) LIMIT 0

And complains about a parse error at the CREATE TABLE step. Now this is
weird to me, because it seems like the driver is automatically wrapping the
CTAS with a "SELECT * FROM ... LIMIT 0"?

An additional complication is that this isn't straight up JDBC. This is
Python calling out to the JDBC driver, so admittedly I might be throwing a
wrench in there somewhere. But it doesn't make sense to me that either the
Python layer or JDBC would modify the query in that way. I've checked the
Python wrapper source and there I couldn't find anything to indicate it
might be changing the query. I also briefly sifted through the JDBC code
and couldn't find anything as well.

Any clues as to what might be going on?

Wes

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-09 Thread Zelaine Fong

I believe Chunhui has a fix for DRILL-5089 but is waiting on the Calcite 
portions of the fix to be accepted.

-- Zelaine

On 3/9/17, 7:47 AM, "Jinfeng Ni" <j...@apache.org> wrote:

DRILL-5089 has hit many drill users for very long time; it was
reported several times on this drill user list. It would make sense to
bump it's priority and get it fixed in the next Drill release.


On Wed, Mar 8, 2017 at 8:45 AM, Zelaine Fong <zf...@mapr.com> wrote:
> The slowness you’re seeing when you have additional storage plugins that 
are not being used by your query is due to DRILL-5089.
>
> -- Zelaine
>
> On 3/7/17, 11:20 PM, "PROJJWAL SAHA" <proj.s...@gmail.com> wrote:
>
> Hi Kunal,
>
> Good catch ! Thanks for this pointer.
> I enabled logging at org.apache.drill level and I found -
>
> 2017-03-08 01:39:01,822 
[274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
>  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
> dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
> ORDER_ID='41' and CUSTOMER_ID='568'
> *2017-03-08 01:39:01,823 
[274058fa-79df-9c74-3219-3fcb83a04605:foreman]
> DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
> StoragePlugin configuration with zookeeper quorum 'localhost', port 
'2181'.*
> *2017-03-08 01:39:16,038 
[274058fa-79df-9c74-3219-3fcb83a04605:foreman]
> DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
> schemas.*
>
>
> i am not sure why the hbase storage plugin comes in play as it is 
disabled.
> i then disabled all the other active plugins that i had and just kept 
the
> dfs plugin.
>
> the planning time is now reduced to 0.9 secs
> and the query time for 1GB partitioned tsv data is taking 3.63 secs
>
> is that a reasonable behaviour ?
>
> Regards,
> Projjwal
>
> On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua <kkha...@mapr.com> 
wrote:
>
> >
> > Looking at the 1st two lines of the log shows that the bulk of time 
was
> > lost before the query even went into the real planning stage of the 
query:
> >
> >
> > 2017-03-07 06:27:28,074 
[274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> > 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
> > dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
> > columns[0] ='41' and columns[3] ='568'
> > 2017-03-07 06:28:00,775 
[274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> > INFO  o.a.d.exec.store.dfs.FileSelection - 
FileSelection.getStatuses()
> > took 0 ms, numFiles: 1
> >
> >
> > More than 30 secs is unaccounted for. Can you turn on the root 
logger to
> > be at the debug level and retry the explain plan?
> >
> >
> > Kunal Khatua
> >
> >
> > 
> > From: rahul challapalli <challapallira...@gmail.com>
> > Sent: Tuesday, March 7, 2017 5:24:43 AM
> > To: user
> > Subject: Re: Minimise query plan time for dfs plugin for local file 
system
> > on tsv file
> >
> > I did not get a chance to review the log file.
> >
> > However the next thing I would try is to make your cluster a single 
node
> > cluster first and then run the same explain plan query separately 
on each
> > individual file.
> >
> >
> >
> > On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <proj.s...@gmail.com> wrote:
> >
> > > Hi Rahul,
> > >
> > > thanks for your suggestions. However, I am still not able to see 
any
> > > reduction in query planning time
> > > by explicit column names, removing extract headers and using
> > columns[index]
> > >
> > > As I said, I ran explain plan and its taking 30+ secs for me.
> > > My data is 1 GB tsv split into 20 files of 5 MB each.
> > > Each 5MB file has close to 50k records
> > > Its a 5 node cluster, and width per node is 4

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-08 Thread Zelaine Fong

The slowness you’re seeing when you have additional storage plugins that are 
not being used by your query is due to DRILL-5089.

-- Zelaine

On 3/7/17, 11:20 PM, "PROJJWAL SAHA"  wrote:

Hi Kunal,

Good catch ! Thanks for this pointer.
I enabled logging at org.apache.drill level and I found -

2017-03-08 01:39:01,822 [274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
 o.a.drill.exec.work.foreman.Foreman - Query text for query id
274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
ORDER_ID='41' and CUSTOMER_ID='568'
*2017-03-08 01:39:01,823 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
StoragePlugin configuration with zookeeper quorum 'localhost', port '2181'.*
*2017-03-08 01:39:16,038 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
schemas.*


i am not sure why the hbase storage plugin comes in play as it is disabled.
i then disabled all the other active plugins that i had and just kept the
dfs plugin.

the planning time is now reduced to 0.9 secs
and the query time for 1GB partitioned tsv data is taking 3.63 secs

is that a reasonable behaviour ?

Regards,
Projjwal

On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua  wrote:

>
> Looking at the 1st two lines of the log shows that the bulk of time was
> lost before the query even went into the real planning stage of the query:
>
>
> 2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
> dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
> columns[0] ='41' and columns[3] ='568'
> 2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
> took 0 ms, numFiles: 1
>
>
> More than 30 secs is unaccounted for. Can you turn on the root logger to
> be at the debug level and retry the explain plan?
>
>
> Kunal Khatua
>
>
> 
> From: rahul challapalli 
> Sent: Tuesday, March 7, 2017 5:24:43 AM
> To: user
> Subject: Re: Minimise query plan time for dfs plugin for local file system
> on tsv file
>
> I did not get a chance to review the log file.
>
> However the next thing I would try is to make your cluster a single node
> cluster first and then run the same explain plan query separately on each
> individual file.
>
>
>
> On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA"  wrote:
>
> > Hi Rahul,
> >
> > thanks for your suggestions. However, I am still not able to see any
> > reduction in query planning time
> > by explicit column names, removing extract headers and using
> columns[index]
> >
> > As I said, I ran explain plan and its taking 30+ secs for me.
> > My data is 1 GB tsv split into 20 files of 5 MB each.
> > Each 5MB file has close to 50k records
> > Its a 5 node cluster, and width per node is 4
> > Therefore, total number of minor fragments for one major fragment is 20
> > I have copied the source directory in all the drillbit nodes
> >
> > can you tell me a reasonable time estimate which I can expect drill to
> > return result for query for the above described scenario.
> > Query is - select columns[0] from dfs.root.`/scratch/localdisk/
> drill/testdata/Cust_1G_20_tsv`
> > where columns[0] ='41' and columns[3] ='568'
> >
> > attached is the json profile
> > and the drillbit.log
> >
> > I also have the tracing enabled.
> > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> > org.apache.drill.exec.work.foreman.Foreman
> > however i see the duration of various steps in the order of ms in the
> logs.
> > i am not sure where planning time of the order of 30 secs is consumed.
> >
> > Please help
> >
> > Regards,
> > Projjwal
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> > challapallira...@gmail.com> wrote:
> >
> >> You can try the below things. For each of the below check the planning
> >> time
> >> individually
> >>
> >> 1. Run explain plan for a simple "select * from `
> >> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
> >> 2. Replace the '*' in your query with explicit column names
> >> 3. Remove the extract header from your storage plugin configuration and
>

Re: Storage Plugin for accessing Hive ORC Table from Drill

2017-01-20 Thread Zelaine Fong

The stack trace shows the following:

Caused by: org.apache.drill.common.exceptions.DrillRuntimeException:
java.io.IOException: Failed to get numRows from HiveTable

The Drill optimizer is trying to read rowcount information from Hive. Could 
this be a problem in your Hive metastore?

Has anyone else seen this before?  

-- Zelaine

On 1/20/17, 7:35 AM, "Andries Engelbrecht"  wrote:

What version of Hive are you using?


--Andries


From: Anup Tiwari 
Sent: Friday, January 20, 2017 3:00:43 AM
To: user@drill.apache.org; d...@drill.apache.org
Subject: Re: Storage Plugin for accessing Hive ORC Table from Drill

Hi,

Please find below Create Table Statement and subsequent Drill Error :-

*Table Structure :*

CREATE TABLE `logindetails_all`(
  `sid` char(40),
  `channel_id` tinyint,
  `c_t` bigint,
  `l_t` bigint)
PARTITIONED BY (
  `login_date` char(10))
CLUSTERED BY (
  channel_id)
INTO 9 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://hostname1:9000/usr/hive/warehouse/logindetails_all'
TBLPROPERTIES (
  'compactorthreshold.hive.compactor.delta.num.threshold'='6',
  'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',
  'transactional'='true',
  'transient_lastDdlTime'='1484313383');
;

*Drill Error :*

*Query* : select * from hive.logindetails_all limit 1;

*Error :*
2017-01-20 16:21:12,625 [277e145e-c6bc-3372-01d0-6c5b75b92d73:foreman]
INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
277e145e-c6bc-3372-01d0-6c5b75b92d73: select * from hive.logindetails_all
limit 1
2017-01-20 16:21:12,831 [277e145e-c6bc-3372-01d0-6c5b75b92d73:foreman]
ERROR o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
NumberFormatException: For input string: "004_"


[Error Id: 53fa92e1-477e-45d2-b6f7-6eab9ef1da35 on
prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
NumberFormatException: For input string: "004_"


[Error Id: 53fa92e1-477e-45d2-b6f7-6eab9ef1da35 on
prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
at

org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543)
~[drill-common-1.9.0.jar:1.9.0]
at

org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:825)
[drill-java-exec-1.9.0.jar:1.9.0]
at
org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:935)
[drill-java-exec-1.9.0.jar:1.9.0]
at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:281)
[drill-java-exec-1.9.0.jar:1.9.0]
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_72]
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
Caused by: org.apache.drill.exec.work.foreman.ForemanException: Unexpected
exception during fragment initialization: Internal error: Error while
applying rule DrillPushProjIntoScan, args

[rel#4220197:LogicalProject.NONE.ANY([]).[](input=rel#4220196:Subset#0.ENUMERABLE.ANY([]).[],sid=$0,channel_id=$1,c_t=$2,l_t=$3,login_date=$4),
rel#4220181:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[hive,
logindetails_all])]
... 4 common frames omitted
Caused by: java.lang.AssertionError: Internal error: Error while applying
rule DrillPushProjIntoScan, args

[rel#4220197:LogicalProject.NONE.ANY([]).[](input=rel#4220196:Subset#0.ENUMERABLE.ANY([]).[],sid=$0,channel_id=$1,c_t=$2,l_t=$3,login_date=$4),
rel#4220181:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[hive,
logindetails_all])]
at org.apache.calcite.util.Util.newInternal(Util.java:792)
~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
at

org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:251)
~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
at

org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
at
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
~[calcite-core-1.4.0-drill-r19.jar:1.4.0-drill-r19]
at

org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:404)
~[drill-java-exec-1.9.0.jar:1.9.0]
at

Re: Cartesian Product in Apache Drill

2016-12-31 Thread Zelaine Fong

I think you meant to ask how likely it is that a nested loop join plan will
be chosen if planner.enable_nljoin_for_*scalar_only *is set to FASLE, and
the query does not contain a cartesian join?

If so, I believe it's unlikely due to how the optimizer computes the cost
of nested loop join plans.  But as I noted in my original reply, I don't
know if this has been extensively tested by other Drill users.

Your suggestion of a new option *planner.enable_nljoin_for_crossjoin *should
effectively be the same as setting planner.enable_nljoin_for_*scalar_only*
to false because the way the optimizer estimates costs, it should never
consider cross join orderings unless that's the only available join
ordering option.  For those cartesian joins, the only option would be a
nested loop join.

-- Zelaine

On Tue, Dec 27, 2016 at 10:26 AM, clhub...@gmail.com <clhub...@gmail.com>
wrote:

> Zelaine,
>
> Thanks for the clarity.  I can understand the performance concerns behind
> the nested loop.
>
> Can a nested loop be selected for "other" query plans when "
> planner.enable_nljoin_for_*scalar_only" *is set to true?
>
> I am trying to determine if I should turn it system wide, or ONLY turn it
> on when I need a Cartesian product result.
>
> Any ideas?  Suggestions?
>
> Outside of the Performance Concern..   There are two valid nested loop
> usecase.
> 1) Scalar subqueries
> 2) ANSI SQL Cross Joins
>
> Do you know if there are plans to incorporate this as a default setting?
>
>
> *planner.enable_nljoin_for_scalar_only *should probably be
> planner.enable_nljoin_for_*scalar *
>
> Similarly, there should be *planner.enable_nljoin_for_crossjoin* and the
> default setting should be "true"
>
> Do you agree?   Am I off base someplace?
>
> If there are no plans, and you agree with my assessment.   I will probably
> file this as a feature request
>
>
>
> Regards,
> CLN
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Dec 26, 2016 at 7:58 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>
>> I'm not sure how widely nested loop joins outside of scalar subqueries
>> have been exercised by Drill users, since that setting is not the default.
>> Note that nested loop joins can only be processed using broadcast joins
>> [1].  So you will incur a lot of network transfer overhead unless the
>> smaller of the tables you're joining is kept to a minimum.
>>
>> [1] https://drill.apache.org/docs/join-planning-guidelines/
>>
>> -- Zelaine
>>
>> On Mon, Dec 26, 2016 at 7:05 AM, clhub...@gmail.com <clhub...@gmail.com>
>> wrote:
>>
>>> Zelaine,
>>>
>>> I appreciate it...   That worked.
>>>
>>> I am thinking of turning on this feature system wide.
>>>
>>> Is there any foreseeable issue with using nested joins outside of scalar
>>> subqueries?  Performance or otherwise?
>>>
>>> Regards,
>>> CLN
>>>
>>>
>>> On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com>
>>> wrote:
>>>
>>>> Alternatively, you can set the following configuration to false:
>>>>
>>>> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>>>>
>>>> Cartesian joins need to be processed as a nested loop join, and by
>>>> default, Drill only considers nested joins in the case where at least one
>>>> side of the join is a scalar subquery.
>>>>
>>>> -- Zelaine
>>>>
>>>> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can fake the limitation by adding a constant column to both
>>>>> tables, I
>>>>> think, and then joining on the constant.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Dec 25, 2016 at 2:04 PM, clhub...@gmail.com <
>>>>> clhub...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> >
>>>>> > I am trying to do a cross join to get a cartesian products.
>>>>> >
>>>>> > Per the error message (attached) and the JIRA ticket I see it isn't
>>>>> > supported.
>>>>> > https://issues.apache.org/jira/browse/DRILL-3807
>>>>> >
>>>>> > I wrote the query against using dfs on csv file types.
>>>>> >
>>>>> > Can I execute a cross join in Apache Drill just by moving my data to
>>>>> a
>>>>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>>>>> > Plugin.
>>>>> >
>>>>> > Regards,
>>>>> > CLN
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Cartesian Product in Apache Drill

2016-12-26 Thread Zelaine Fong

I'm not sure how widely nested loop joins outside of scalar subqueries have
been exercised by Drill users, since that setting is not the default.  Note
that nested loop joins can only be processed using broadcast joins [1].  So
you will incur a lot of network transfer overhead unless the smaller of the
tables you're joining is kept to a minimum.

[1] https://drill.apache.org/docs/join-planning-guidelines/

-- Zelaine

On Mon, Dec 26, 2016 at 7:05 AM, clhub...@gmail.com <clhub...@gmail.com>
wrote:

> Zelaine,
>
> I appreciate it...   That worked.
>
> I am thinking of turning on this feature system wide.
>
> Is there any foreseeable issue with using nested joins outside of scalar
> subqueries?  Performance or otherwise?
>
> Regards,
> CLN
>
>
> On Sun, Dec 25, 2016 at 7:22 PM, Zelaine Fong <zf...@maprtech.com> wrote:
>
>> Alternatively, you can set the following configuration to false:
>>
>> alter session set `planner.enable_nljoin_for_scalar_only` = false;
>>
>> Cartesian joins need to be processed as a nested loop join, and by
>> default, Drill only considers nested joins in the case where at least one
>> side of the join is a scalar subquery.
>>
>> -- Zelaine
>>
>> On Sun, Dec 25, 2016 at 2:46 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>
>>> You can fake the limitation by adding a constant column to both tables, I
>>> think, and then joining on the constant.
>>>
>>>
>>>
>>> On Sun, Dec 25, 2016 at 2:04 PM, clhub...@gmail.com <clhub...@gmail.com>
>>> wrote:
>>>
>>> >
>>> > I am trying to do a cross join to get a cartesian products.
>>> >
>>> > Per the error message (attached) and the JIRA ticket I see it isn't
>>> > supported.
>>> > https://issues.apache.org/jira/browse/DRILL-3807
>>> >
>>> > I wrote the query against using dfs on csv file types.
>>> >
>>> > Can I execute a cross join in Apache Drill just by moving my data to a
>>> > different file type or Storage Plugin. Such as Parquet,JSON, or RDBMS
>>> > Plugin.
>>> >
>>> > Regards,
>>> > CLN
>>> >
>>> >
>>>
>>
>>
>

Re: Apache Drill Hangout Minutes - 11/1/16

2016-11-01 Thread Zelaine Fong

Oops, mistake in my notes.  For the second item, I meant DRILL-4280, not
DRILL-1950.

On Tue, Nov 1, 2016 at 3:40 PM, Zelaine Fong <zf...@maprtech.com> wrote:

> Attendees: Paul, Padma, Sorabh, Boaz, Sudheesh, Vitalii, Roman, Dave O,
> Arina, Laurent, Kunal, Zelaine
>
> I had to leave the hangout at 10:30, so my notes only cover the discussion
> up till then.
>
> 1) Variable width decimal support - Dave O
>
> Currently Drill only supports fixed width byte array storage of decimals.
> Dave has submitted a pull request for DRILL-4834 to add support for storing
> decimals with variable width byte arrays.  Eventually, variable width can
> replace fixed width, but the pull request doesn't cover that.  Dave would
> like someone in the community to review his pull request.
>
> 2) 1.9 release - Sudheesh
>
> Sudheesh is collecting pull requests for the release.  Some have been
> reviewed and are waiting to be merged.  Sudheesh plans to commit a batch
> this Wed and another this Friday.  He's targeting having a release
> candidate build available next Monday.
>
> Laurent asked about Sudheesh's pull request for DRILL-1950.  He asked
> whether thought had been given to supporting newer Drill clients with older
> Drill servers.  Sudheesh indicated that doing this would entail a breaking
> change in the protocol, and the plan was to defer doing this for a later
> release where we may want to make other breaking changes like this.
>

Drill Hangout Minutes - 10/18/16

2016-10-18 Thread Zelaine Fong

Attendees - Jinfeng, Karthik, Khurram, Sorabh, Kunal, Padma, Zelaine

Questions on behavior of Drill when querying on empty files/directories -
Khurram

   - Empty directory case currently returns table not found.  Consensus was
   that this is correct.  Because without any files, Drill doesn't know which
   reader to use, and therefore, can't do any processing.  Khurram will try
   this out and compare against the case of a directory with empty JSON/csv
   files, which should return no rows found.
   - Questions on recent UNION ALL handling of empty files.  Khurram will
   discuss offline with Aman.
   - Directory with a metadata file but no data  (due to user error in
   deleting data) -- Consensus was this should return an error since the
   metadata doesn't match with the non-existent data.

Drill Hangout Minutes - 10/4/16

2016-10-04 Thread Zelaine Fong

Attendees - Roman, Vitalii, Sorabh, Sudheesh, Kunal, Anil Kumar, Arina,
Padma, Laurent, Paul, Khurram, Gautam, Zelaine

1) Laurent - Client side changes to support metadata queries

Laurent indicated that the server side changes corresponding to these
client side changes were already previously committed by Venki.  They've
worked with Simba on the ODBC driver changes to take advantage of the new
APIs, and have tested against Tableau.  Laurent said he didn't have
performance numbers for these improvements.  Paul asked if there's a
writeup corresponding to improvements.  Laurent indicated that he felt the
information documented in the Jira and pull request should cover this.

2) Anil Kumar - Kafka plugin

Anil was looking for guidance on how to build this plugin.  He will start
work on this and hopes to have something in about a month.  After that, he
wants to work on a Cassandra plugin.  He'll probably start with the work
previously done by Yash (DRILL-92).

Drill Hangout Meeting Minutes - 9/20/16

2016-09-20 Thread Zelaine Fong

Attendees - Aman, Padma, Sorabh, Karthik, Parth, Boaz, Arina, Gautam,
Serhiy, Vitalii, Charles Givre, Zelaine

1) Features in progress for 1.9 release

   - Parquet filter pushdown (Jinfeng)
   - Number of distinct values stats (Gautam)
   - Parquet file reader performance improvements (Parth)
   - Dynamic UDFs (Arina)

2) DRILL-3423 - HTTPD format plugin

Charles asked about this.  He would like to see this committed into Drill.
He's made some additional changes on top of Jim Scott's pull request.
However, he hasn't addressed the issues that Jacques raised last Nov in
Jim's original pull request.  Aman will follow up with Jim to see if he's
doing any work on his pull request.

3) New MapR folks working on Drill

   - Boaz Ben-Zvi
   - Karthik Manivannan
   - Sorabh Hamirwasia

Drill Hangout Minutes - 9/6/16

2016-09-06 Thread Zelaine Fong

Attendees - Paul, Parth, Boaz, Padma, Vitalii, Roman, Serhiy, Argenis Leon,
Arina, Khurram, Jinfeng, Zelaine

1) Design document template - Parth is proposing we start using this moving
forward.  He'll send out a separate note to the dev mailing list.  Assuming
there are no objections, he'll put the doc in a shareable location on
Google drive and then work with Bridget to update the dev guidelines and
include a link to the shared doc.

2) Release cadence - Moving forward, we'd like to switch to a release
cadence of 3 months.  This seems to be the right balance between not
waiting too long (e.g., in the case of the 1.7 release, there were a number
of questions on why that release lagged so long after the 1.6 release), and
providing enough time for QA to sufficiently test in between releases.
Parth will again email the dev list for additional feedback.

3) Code reviews - There are a number of pending pull requests needing
review.  We discussed using Jira to track these, by setting the status to
"Reviewable" and changing the owner to the reviewer.

Re: Query hangs on planning

2016-09-01 Thread Zelaine Fong

One other thing ... have you tried tuning the planner.memory_limit
parameter?  Based on the earlier stack trace, you're hitting a memory limit
during query planning.  So, tuning this parameter should help that.  The
default is 256 MB.

-- Zelaine

On Thu, Sep 1, 2016 at 11:21 AM, rahul challapalli <
challapallira...@gmail.com> wrote:

> While planning we use heap memory. 2GB of heap should be sufficient for
> what you mentioned. This looks like a bug to me. Can you raise a jira for
> the same? And it would be super helpful if you can also attach the data set
> used.
>
> Rahul
>
> On Wed, Aug 31, 2016 at 9:14 AM, Oscar Morante 
> wrote:
>
> > Sure,
> > This is what I remember:
> >
> > * Failure
> >- embedded mode on my laptop
> >- drill memory: 2Gb/4Gb (heap/direct)
> >- cpu: 4cores (+hyperthreading)
> >- `planner.width.max_per_node=6`
> >
> > * Success
> >- AWS Cluster 2x c3.8xlarge
> >- drill memory: 16Gb/32Gb
> >- cpu: limited by kubernetes to 24cores
> >- `planner.width.max_per_node=23`
> >
> > I'm very busy right now to test again, but I'll try to provide better
> info
> > as soon as I can.
> >
> >
> >
> > On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:
> >
> >> Can you please share the number of cores on the setup where the query
> hung
> >> as compared to the number of cores on the setup where the query went
> >> through successfully.
> >> And details of memory from the two scenarios.
> >>
> >> Thanks,
> >> Khurram
> >>
> >> On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante 
> >> wrote:
> >>
> >> For the record, I think this was just bad memory configuration after
> all.
> >>> I retested on bigger machines and everything seems to be working fine.
> >>>
> >>>
> >>> On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:
> >>>
> >>> Oscar, can you please report a JIRA with the required steps to
> reproduce
>  the OOM error. That way someone from the Drill team will take a look
> and
>  investigate.
> 
>  For others interested here is the stack trace.
> 
>  2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-
> 378aaa4ce50e:foreman]
>  ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
>  Occurred,
>  exiting. Information message: Unable to handle out of memory condition
>  in
>  Foreman.
>  java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Arrays.java:2694)
>  ~[na:1.7.0_111]
> at java.lang.String.(String.java:203) ~[na:1.7.0_111]
> at java.lang.StringBuilder.toString(StringBuilder.java:405)
>  ~[na:1.7.0_111]
> at org.apache.calcite.util.Util.newInternal(Util.java:785)
>  ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
> at
>  org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(
>  VolcanoRuleCall.java:251)
>  ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
> at
>  org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(
>  VolcanoPlanner.java:808)
>  ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
> at
>  org.apache.calcite.tools.Programs$RuleSetProgram.run(
> Programs.java:303)
>  ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
> at
>  org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>  .transform(DefaultSqlHandler.java:404)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>  .transform(DefaultSqlHandler.java:343)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>  .convertToDrel(DefaultSqlHandler.java:240)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>  .convertToDrel(DefaultSqlHandler.java:290)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge
>  tPlan(ExplainHandler.java:61)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri
>  llSqlWorker.java:94)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.
> java:
>  257)
>  ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
> at
>  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>  Executor.java:1145)
>  [na:1.7.0_111]
> at
>

Re: show tables taking long time on hive plugin

2016-08-18 Thread Zelaine Fong

Have you tried tweaking Drill's Hive metastore cache parameters?  See
https://drill.apache.org/docs/hive-metadata-caching/.

-- Zelaine

On Wed, Aug 17, 2016 at 10:02 PM, Sungwook Yoon  wrote:

> Hi,
>
> Drill 1.6.0
> MapR 3.1.1
> Hive 0.13
> HS2 SQL Authorization
> Hive Meta storage authorization
>
> Hive Default database has 1400 tables.
>
> I see significant slow down in "show tables" when Hive database is bigger
> than 500 tables.
> It takes 3 minutes to get the resultset.
> There is no obvious logs on drillbit or zookeeper to help diagnose the
> situation.
>
> Beeline is snappy, returns 1400 tables in a second or so.
>
> Why Drill "show tables" on Hive plugin takes this long?
>
> Thanks,
>
> Sungwook
>

Re: will hbase storage engine scan twice?

2016-08-09 Thread Zelaine Fong

It looks like you're encountering DRILL-4771.  Note that the problem is not
specific to querying from HBase.

-- Zelaine

On Tue, Aug 9, 2016 at 3:09 AM, qiang li  wrote:

> I did some research and find out that the logic plan not good if contains
> count(distinct ) clause.
>
> I have to change my sql to select count(*) from (select aa, sum() from tt
> group by aa) t to avoid this issue.
>
> 2016-08-09 14:24 GMT+08:00 qiang li :
>
> > Hi
> >
> > We use drill to query hbase.
> >
> > Recently I check one of our query's plan:
> > 0: jdbc:drill:drillbit=rfdc2> explain plan for SELECT '2016-07-27 00:00'
> > as key, sum(convert_from(action.`v`.`c`,'INT_BE')) AS pv,COUNT(DISTINCT
> > BYTE_SUBSTR(action.row_key,-8,8)) AS uv, SUM(convert_from(action.`v`.`
> s`,'INT_BE'))
> > AS v from hbase.`action_combine` as action  where  action.row_key
> > >'020160727pay.search' and action.row_key < '020160727pay.searci' ;
> >
> > the result is :
> > +--+--+
> > | text | json |
> > +--+--+
> > | 00-00Screen
> > 00-01  Project(key=[$0], pv=[$1], uv=[$2], v=[$3])
> > 00-02Project(key=['2016-07-27 00:00'], pv=[$0], uv=[$2], v=[$1])
> > 00-03  NestedLoopJoin(condition=[true], joinType=[inner])
> > 00-05StreamAgg(group=[{}], pv=[SUM($0)], v=[SUM($1)])
> > 00-07  StreamAgg(group=[{}], pv=[SUM($0)], v=[SUM($2)])
> > 00-08Project($f0=[CONVERT_FROMINT_BE(ITEM($1, 'c'))],
> > $f1=[BYTE_SUBSTR($0, -8, 8)], $f2=[CONVERT_FROMINT_BE(ITEM($1, 's'))])
> > 00-09  Scan(groupscan=[HBaseGroupScan
> > [HBaseScanSpec=HBaseScanSpec [tableName=action_combine,
> > startRow=020160727pay.search\x00, stopRow=020160727pay.searci,
> > filter=null], columns=[`*`]]])
> > 00-04StreamAgg(group=[{}], uv=[$SUM0($0)])
> > 00-06  UnionExchange
> > 01-01StreamAgg(group=[{}], uv=[COUNT($0)])
> > 01-02  HashAgg(group=[{0}])
> > 01-03Project($f1=[$0])
> > 01-04  HashToRandomExchange(dist0=[[$0]])
> > 02-01UnorderedMuxExchange
> > 03-01  Project($f1=[$0],
> > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> > 03-02HashAgg(group=[{0}])
> > 03-03  Project($f1=[BYTE_SUBSTR($0, -8, 8)])
> > 03-04Scan(groupscan=[HBaseGroupScan
> > [HBaseScanSpec=HBaseScanSpec [tableName=action_combine,
> > startRow=020160727pay.search\x00, stopRow=020160727pay.searci,
> > filter=null], columns=[`*`]]])
> >
> > My question is as there have two scan in the plan , does this mean's the
> > query will execute twice?
> >
> >
>

Fwd: Suggestions for hangout topics for 08/09

2016-08-08 Thread Zelaine Fong

Jason -- will you be able to join tomorrow's hangout, since you had raised
questions about Subbu's pull request?

-- Zelaine

On Mon, Aug 8, 2016 at 11:33 AM, Gautam Parai  wrote:

> Tomorrow's hangout is scheduled for 10AM - 11AM PST
>
> On Mon, Aug 8, 2016 at 11:30 AM, Subbu Srinivasan  >
> wrote:
>
> > What time is tomorrow's mtg scheduled for?
> >
> >
> > On Mon, Aug 8, 2016 at 10:48 AM, Gautam Parai 
> wrote:
> >
> > > If you have any suggestions for Drill hangout topics for tomorrow,  you
> > can
> > > add it to this thread.  We will also ask around at the beginning of the
> > > hangout for any topics.  We will try to cover whatever possible during
> > the
> > > 1 hr.
> > >
> > > Topics:
> > >   1.  DRILL-4653:  Malformed JSON should not stop the entire query from
> > > progressing.
> > >Discussion about the PR.
> > >
> >
>

Minutes from 7/26/16 Drill Hangout

2016-07-26 Thread Zelaine Fong

Attendees: Arina, Roman, Parth, Padma, Khurram, Gautam, Vitalii

This was a short meeting.  The only topic we covered was Arina's latest
email on Dynamic UDF support.  Arina wanted to know if anyone had
additional feedback.  Parth indicated he was ok with her revised proposal.
Arina will proceed with updating her design doc.

Drill Hangout Starting Now ...

2016-07-26 Thread Zelaine Fong

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Re: Pushdown Capabilities with RDBMS

2016-07-16 Thread Zelaine Fong

Note though that the decision on whether to push down the join is dependent
on the cost of doing so relative to other options.  I believe there are
currently limitations in the costing model of the JDBC storage plugin that
prevent the optimizer from choosing the "correct" plan in all cases.

But as Sudheesh noted, it would be good to capture your scenario in a Jira
ticket.

-- Zelaine

On Fri, Jul 15, 2016 at 5:26 PM, Sudheesh Katkam 
wrote:

> Hi Marcus,
>
> I am glad that you are exploring Drill! Per RDBMS storage plugin
> documentation [1], join pushdown is supported. So the scenario you
> described is likely a bug; can you open a ticket [2] with the details on
> how to reproduce the issue?
>
> Thank you,
> Sudheesh
>
> [1] https://drill.apache.org/docs/rdbms-storage-plugin/ <
> https://drill.apache.org/docs/rdbms-storage-plugin/>
> [2] https://issues.apache.org/jira/browse/DRILL <
> https://issues.apache.org/jira/browse/DRILL>
>
> > On Jul 15, 2016, at 1:48 PM, Marcus Rehm  wrote:
> >
> > Hi all,
> >
> > I started to teste Drill and I'm very excited about the possibilities.
> >
> > By now I'm trying to map ours databases running on Oracle 11g. After try
> > some queries I realized that the amount of time Drill takes to complete
> is
> > bigger than a general sql client takes. Looking the execution plan I saw
> > (or understood) that Drill is doing the join of tables and is not pushing
> > it down to the database.
> >
> > Is there any configuration required to it? How can I tell Drill to send
> to
> > Oracle the task of doing the join?
> >
> > Thanks in Advance.
> >
> > Best regards,
> > Marcus Rehm
>
>

Re: Looking for workaround to Schema detection problems

2016-07-08 Thread Zelaine Fong

The issues you're seeing once GROUP BY is added sound like DRILL-4503.

-- Zelaine

On Fri, Jul 8, 2016 at 9:11 AM, Holy Alexander <
alexander.h...@derstandard.at> wrote:

> Solved!
>
> Actually ALTER SESSION SET `store.json.all_text_mode` = true; improves
> things a bit, but only for a simple select.
> As soon as I added GROUP BY and a second column things went haywire again.
>
> CASTing everything everywhere did solve the problem.
>
> It seems to me that CASTing everything is a best practice as soon as you
> have optional/NULLable columns
>
> Best regards & thanks to everyone!
> Alexander
>
> -Original Message-
> From: Zelaine Fong [mailto:zf...@maprtech.com]
> Sent: 08 July 2016 17:42
> To: user@drill.apache.org
> Subject: Re: Looking for workaround to Schema detection problems
>
> Note that the 1.7 release now has the fix for DRILL-4479.  So, that may
> explain why in the past the setting didn't help.
>
> -- Zelaine
>
> On Fri, Jul 8, 2016 at 8:38 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > In the past setting the below parameter still did not fix the issue.
> > But still worth a try
> >
> > ALTER SESSION SET `store.json.all_text_mode` = true;
> >
> > You might also want to try explicit casting to varchar for this
> > specific column.
> >
> > On Fri, Jul 8, 2016 at 8:14 AM, Zelaine Fong <zf...@maprtech.com> wrote:
> >
> > > Have you tried using
> > >
> > > ALTER SESSION SET `store.json.all_text_mode` = true;
> > >
> > > -- Zelaine
> > >
> > > On Fri, Jul 8, 2016 at 6:37 AM, Holy Alexander <
> > > alexander.h...@derstandard.at> wrote:
> > >
> > > > Hi Vitalii!
> > > >
> > > >
> > > > This is what I tried:
> > > >
> > > > Altered the setting system-wide:
> > > >
> > > > ALTER SYSTEM SET `exec.enable_union_type` = true
> > > >
> > > > Verified that the setting is really altered
> > > >
> > > > SELECT *
> > > > FROM sys.options
> > > > WHERE type in ('SYSTEM','SESSION') order by name
> > > >
> > > > And re-run the query
> > > >
> > > > Unfortunately this does not solve the problem.
> > > > It just causes a different error:
> > > >
> > > > [30027]Query execution error. Details:[ SYSTEM ERROR:
> > > > NullPointerException Fragment 0:0 [Error Id:
> > > > 0f9cb7ae-d2d5-474c-ad57-2d558041e2c6 on
> > > >
> > > > (I tried this on Drill 1.7 and 1.6)
> > > >
> > > > Best regards,
> > > > Alexander
> > > >
> > > >
> > > > -Original Message-
> > > > From: Vitalii Diravka [mailto:vitalii.dira...@gmail.com]
> > > > Sent: 08 July 2016 13:30
> > > > To: user@drill.apache.org
> > > > Subject: Re: Looking for workaround to Schema detection problems
> > > >
> > > > Hi Alexander,
> > > >
> > > > Please try with turning on the union type:
> > > >
> > > > ALTER SESSION SET `exec.enable_union_type` = true;
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > > 2016-07-08 10:50 GMT+00:00 Holy Alexander <
> > alexander.h...@derstandard.at
> > > >:
> > > >
> > > > > My JSON data looks - simplified - like this
> > > > >
> > > > > {"ID":1,"a":"some text"}
> > > > > {"ID":2,"a":"some text","b":"some other text"} {"ID":3,"a":"some
> > > > > text"}
> > > > >
> > > > > Column b is only physically serialized when it is not null.
> > > > > It is the equivalent of a NULLable VARCHAR() column in SQL.
> > > > >
> > > > > I run queries like these:
> > > > >
> > > > > SELECT b
> > > > > FROM dfs.`D:\MyData\test.json`
> > > > > WHERE b IS NOT NULL
> > > > >
> > > > > And normally all is fine.
> > > > > However, among my thousands of data files, I have two files
> > > > > where the first occurrence of b happens a few thousand records
> down the file.
> > > > > These two data files would look like this:
> > > > >
> > > > > {"ID":1,"a&qu

Re: Looking for workaround to Schema detection problems

2016-07-08 Thread Zelaine Fong

Note that the 1.7 release now has the fix for DRILL-4479.  So, that may
explain why in the past the setting didn't help.

-- Zelaine

On Fri, Jul 8, 2016 at 8:38 AM, rahul challapalli <
challapallira...@gmail.com> wrote:

> In the past setting the below parameter still did not fix the issue. But
> still worth a try
>
> ALTER SESSION SET `store.json.all_text_mode` = true;
>
> You might also want to try explicit casting to varchar for this specific
> column.
>
> On Fri, Jul 8, 2016 at 8:14 AM, Zelaine Fong <zf...@maprtech.com> wrote:
>
> > Have you tried using
> >
> > ALTER SESSION SET `store.json.all_text_mode` = true;
> >
> > -- Zelaine
> >
> > On Fri, Jul 8, 2016 at 6:37 AM, Holy Alexander <
> > alexander.h...@derstandard.at> wrote:
> >
> > > Hi Vitalii!
> > >
> > >
> > > This is what I tried:
> > >
> > > Altered the setting system-wide:
> > >
> > > ALTER SYSTEM SET `exec.enable_union_type` = true
> > >
> > > Verified that the setting is really altered
> > >
> > > SELECT *
> > > FROM sys.options
> > > WHERE type in ('SYSTEM','SESSION') order by name
> > >
> > > And re-run the query
> > >
> > > Unfortunately this does not solve the problem.
> > > It just causes a different error:
> > >
> > > [30027]Query execution error. Details:[
> > > SYSTEM ERROR: NullPointerException
> > > Fragment 0:0
> > > [Error Id: 0f9cb7ae-d2d5-474c-ad57-2d558041e2c6 on
> > >
> > > (I tried this on Drill 1.7 and 1.6)
> > >
> > > Best regards,
> > > Alexander
> > >
> > >
> > > -Original Message-
> > > From: Vitalii Diravka [mailto:vitalii.dira...@gmail.com]
> > > Sent: 08 July 2016 13:30
> > > To: user@drill.apache.org
> > > Subject: Re: Looking for workaround to Schema detection problems
> > >
> > > Hi Alexander,
> > >
> > > Please try with turning on the union type:
> > >
> > > ALTER SESSION SET `exec.enable_union_type` = true;
> > >
> > > Kind regards
> > > Vitalii
> > >
> > > 2016-07-08 10:50 GMT+00:00 Holy Alexander <
> alexander.h...@derstandard.at
> > >:
> > >
> > > > My JSON data looks - simplified - like this
> > > >
> > > > {"ID":1,"a":"some text"}
> > > > {"ID":2,"a":"some text","b":"some other text"} {"ID":3,"a":"some
> > > > text"}
> > > >
> > > > Column b is only physically serialized when it is not null.
> > > > It is the equivalent of a NULLable VARCHAR() column in SQL.
> > > >
> > > > I run queries like these:
> > > >
> > > > SELECT b
> > > > FROM dfs.`D:\MyData\test.json`
> > > > WHERE b IS NOT NULL
> > > >
> > > > And normally all is fine.
> > > > However, among my thousands of data files, I have two files where the
> > > > first occurrence of b happens a few thousand records down the file.
> > > > These two data files would look like this:
> > > >
> > > > {"ID":1,"a":"some text"}
> > > > {"ID":2,"a":"some text"}
> > > > ... 5000 more records without column b ...
> > > > {"ID":5002,"a":"some text","b":"some other text"}
> {"ID":5003,"a":"some
> > > > text"}
> > > >
> > > > In this case, my simple SQL query above fails:
> > > >
> > > > [30027]Query execution error. Details:[ DATA_READ ERROR: Error
> parsing
> > > > JSON - You tried to write a VarChar type when you are using a
> > > > ValueWriter of type NullableIntWriterImpl.
> > > > File  /D:/MyData/test.json
> > > > Record 5002 Fragment ...
> > > >
> > > > It seems that the Schema inference mechanism of Drill only samples a
> > > > certain amount of bytes (or records) to determine the schema.
> > > > If the first occurrence of a schema detail happens to far down things
> > > > go boom.
> > > >
> > > > I am now looking for a sane way to work around this.
> > > > Preferred by extending the query and not by altering my massive
> > > > amounts of data.
> > > >
> > > > BTW, I tried altering the data by chaning the first line:
> > > > {"ID":1,"a":"some text","b":null}
> > > > does not help.
> > > >
> > > > Of course, changing the first line to
> > > > {"ID":1,"a":"some text","b":""}
> > > > solves the problem, but this is not a practical solution.
> > > >
> > > > Any help appreciated.
> > > > Alexander
> > > >
> > >
> >
>

Re: [ANNOUNCE] Apache Drill 1.7.0 released

2016-06-29 Thread Zelaine Fong

Yup, two releases in a row that passed on the first vote!

-- Zelaine

On Wed, Jun 29, 2016 at 9:20 AM, Parth Chandra  wrote:

> Nice work team!
> And thanks Aman for managing the release so smoothly .
>
>
>
> On Tue, Jun 28, 2016 at 9:14 PM, Aman Sinha  wrote:
>
> > On behalf of the Apache Drill community, I am happy to announce the
> > release of Apache Drill 1.7.0.
> >
> > The source and binary artifacts are available at [1]
> > Review a complete list of fixes and enhancements at [2]
> >
> > This release of Drill fixes many issues and introduces a number of
> > enhancements, including JMX enablement for monitoring, support for Hive
> > CHAR type and HBase 1.x support.
> >
> > Thanks to everyone in the community who contributed to this release.
> >
> > [1] https://drill.apache.org/download/
> > [2] https://drill.apache.org/docs/apache-drill-1-7-0-release-notes/
> >
> >
> > -Aman
> >
>

Drill Hangout Meeting Minutes - 6/28/16

2016-06-28 Thread Zelaine Fong

Attendees: Aman, Jinfeng, Sudheesh, Parth, Arina, Paul, John O, Padma,
Gautam, Khurram, Zelaine

1) JDBC Storage Plugin Issues

We discussed DRILL-4696.  Parth and Sudheesh suggested that even with the
change suggested in DRILL-4177, this particular problem may still result in
out of memory because the MySQL fetch size needs to be set at the statement
level instead of in the connect string.  Parth added a comment with this
suggestion in DRILL-4177.

Aside from the memory problem, the fact that the 4-way join isn't being
pushed completely to MySQL may be a limitation in the JDBC storage plugin.
The plugin doesn't seem to be taking into consideration the actual
underlying row count of the table.  Aman suggested possibly tweaking one of
the optimizer parameters as a short-term workaround to force all plans to
be fully pushed down.

2) 1.7 has been released in open source Apache!

Re: Issue with join query having same storage plugin name and same datatype

2016-06-21 Thread Zelaine Fong

Sanjiv,

This is very likely a bug in the JDBC storage plugin.  I would suggest you
open a Jira with the information necessary to reproduce the problem.

-- Zelaine

On Tue, Jun 21, 2016 at 12:56 AM, Sanjiv Kumar  wrote:

> I want to join two table using same storage plugin. But One Of the Column
> showing null value.I am using drill in window 8 system with latest version
> of drill(i.e;1.6) .
>
> I am using this query:-
>
> SELECT T2.ID AS T_ID,T1.ID AS T1_ID ,T1.ProviderID AS ProviderID,
>
>  T1.Name AS Name, T2.Code AS Code
>  FROM SqlServer.StuentDetails.P21.Class1 AS T1
>  right outer JOIN SqlServer.StuentDetails.P21.Class2
>  AS T2 ON T1.ProviderID = T2.ID
>
> Here SqlServer is Storage Plugin Name,
>
>  StuentDetails is Database Name,
>
> P21 is a Schema Name, Class1,
>
>  Class2 are Table Names.
>
> While executing this query T_ID showing Null.
>
> But If I am using two different storage plugin name with same
> credential it works properly.
>
> Note:- NOT ONLY IN SQL SERVER, I TRIED IN MYSQL, ORACLE DB ALSO.. SAME
> PROBLEM CAME. I THINK PROBLEM WITH THE SAME STORAGE PLUGIN NAME HAVING
> SAME DATATYPE, BECAUSE WHEN I TRIED TO GIVE DIFFERENT DATATYPE TO BOTH
> THE TABLE, OUTPUT COMING FINE.
>
> *Is it possible to join two table using same storage plugin name? If
> yes,then What am I doing wrong in this query?*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>  ..
>   Thanks & Regards
>   *Sanjiv Kumar*
>

Re: Partition reading problem (like operator) while using hive partition table in drill

2016-06-01 Thread Zelaine Fong

Shankar,

Work on this issue has not yet started.  Hopefully, the engineer assigned
to the issue will be able to take a look in a week or so.

-- Zelaine

On Tue, May 31, 2016 at 10:33 PM, Shankar Mane 
wrote:

> I didn't get any response or updates on this jira ticket ( DRILL-4665).
>
> Does anyone looking into this?
> On 11 May 2016 03:31, "Aman Sinha"  wrote:
>
> > The Drill test team was able to repro this and is now filed as:
> > https://issues.apache.org/jira/browse/DRILL-4665
> >
> > On Tue, May 10, 2016 at 8:16 AM, Aman Sinha 
> wrote:
> >
> > > This is supposed to work, especially since LIKE predicate is not even
> on
> > > the partitioning column (it should work either way).  I did a quick
> test
> > > with file system tables and it works for LIKE conditions.  Not sure yet
> > > about Hive tables.  Could you pls file a JIRA and we'll follow up.
> > > Thanks.
> > >
> > > -Aman
> > >
> > > On Tue, May 10, 2016 at 1:09 AM, Shankar Mane <
> > shankar.m...@games24x7.com>
> > > wrote:
> > >
> > >> Problem:
> > >>
> > >> 1. In drill, we are using hive partition table. But explain plan (same
> > >> query) for like and = operator differs and used all partitions in case
> > of
> > >> like operator.
> > >> 2. If you see below drill explain plans: Like operator uses *all*
> > >> partitions where
> > >> = operator uses *only* partition filtered by log_date condition.
> > >>
> > >> FYI- We are storing our logs in hive partition table (parquet,
> > >> gz-compressed). Each partition is having ~15 GB data. Below is the
> > >> describe
> > >> statement output from hive:
> > >>
> > >>
> > >> / Hive
> > >>
> > >>
> >
> /
> > >> hive> desc hive_kafkalogs_daily ;
> > >> OK
> > >> col_name data_type comment
> > >> sessionid   string
> > >> ajaxurl string
> > >>
> > >> log_date string
> > >>
> > >> # Partition Information
> > >> # col_name data_type   comment
> > >>
> > >> log_date string
> > >>
> > >>
> > >>
> > >>
> > >> /*
> Drill
> > >> Plan (query with LIKE)
> > >>
> > >>
> >
> ***/
> > >>
> > >> explain plan for select sessionid, servertime, ajaxUrl from
> > >> hive.hive_kafkalogs_daily where log_date = '2016-05-09' and ajaxUrl
> like
> > >> '%utm_source%' limit 1 ;
> > >>
> > >> +--+--+
> > >> | text | json |
> > >> +--+--+
> > >> | 00-00Screen
> > >> 00-01  Project(sessionid=[$0], servertime=[$1], ajaxUrl=[$2])
> > >> 00-02SelectionVectorRemover
> > >> 00-03  Limit(fetch=[1])
> > >> 00-04UnionExchange
> > >> 01-01  SelectionVectorRemover
> > >> 01-02Limit(fetch=[1])
> > >> 01-03  Project(sessionid=[$0], servertime=[$1],
> > >> ajaxUrl=[$2])
> > >> 01-04SelectionVectorRemover
> > >> 01-05  Filter(condition=[AND(=($3, '2016-05-09'),
> > >> LIKE($2, '%utm_source%'))])
> > >> 01-06Scan(groupscan=[HiveScan
> > >> [table=Table(dbName:default, tableName:hive_kafkalogs_daily),
> > >> columns=[`sessionid`, `servertime`, `ajaxurl`, `log_date`],
> > >> numPartitions=29, partitions= [Partition(values:[2016-04-11]),
> > >> Partition(values:[2016-04-12]), Partition(values:[2016-04-13]),
> > >> Partition(values:[2016-04-14]), Partition(values:[2016-04-15]),
> > >> Partition(values:[2016-04-16]), Partition(values:[2016-04-17]),
> > >> Partition(values:[2016-04-18]), Partition(values:[2016-04-19]),
> > >> Partition(values:[2016-04-20]), Partition(values:[2016-04-21]),
> > >> Partition(values:[2016-04-22]), Partition(values:[2016-04-23]),
> > >> Partition(values:[2016-04-24]), Partition(values:[2016-04-25]),
> > >> Partition(values:[2016-04-26]), Partition(values:[2016-04-27]),
> > >> Partition(values:[2016-04-28]), Partition(values:[2016-04-29]),
> > >> Partition(values:[2016-04-30]), Partition(values:[2016-05-01]),
> > >> Partition(values:[2016-05-02]), Partition(values:[2016-05-03]),
> > >> Partition(values:[2016-05-04]), Partition(values:[2016-05-05]),
> > >> Partition(values:[2016-05-06]), Partition(values:[2016-05-07]),
> > >> Partition(values:[2016-05-08]), Partition(values:[2016-05-09])],
> > >>
> > >>
> >
> inputDirectories=[hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160411,
> > >> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160412,
> > >> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160413,
> > >> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160414,
> > >> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160415,
> > >>

Minutes from 5/31/16 Drill Hangout

2016-05-31 Thread Zelaine Fong

Attendees: Arina, John O, Subhu, Vitalii, Hakim, Parth, Aman, Paul,
Jinfeng, Aditya, Zelaine

1) John noted that he's hitting a lot of problems with Drill scripts.  Paul
indicated that he's fixing a lot of these issues as part of his work to
integrate Drill with YARN.  John said he's writing up a document outlining
his findings that he will share with the community in a few days.

2) John suggested adding a new description field to sys.options as a way of
documenting the different system configuration options, as he's been
struggling to make sense of some of the options.  He's logged DRILL-4699
for this enhancement.  During the discussion, it was also noted that Drill
isn't being consistent in the naming convention of some of the current
options.  From a backward compatibility standpoint, it may not be possible
to rename existing options, but for the future, we should publish some
guidelines to ensure future consistency.

3) Aditya has a pull request for DRILL-4199 to upgrade to HBase 1.1.3.
Jacques has reviewed the change, but Aditya would like another pair of eyes
on the change.  Aditya will reach out to Steven to see if he can take a
look at the change.  He'll also reach out to QA to ensure this is on the QA
radar.

4) Arina is making updates to DRILL-4571, based on feedback from Krystal.
Since DRILL-4571 is already marked resolved, she asked whether she should
open a new Jira or reuse the existing one.  The suggestion was to open a
new one and link the new one back to DRILL-4571.

5) Vitalii is working on DRILL-3510 to make double quotes an alternative to
backtick.  Due to audio difficulties, we couldn't discern his specific
question.  So, it was suggested that he post his question on the dev list.

6) John is encountering a problem where garbage collection is putting his
cluster into a bad state.  He asked whether he should open a ticket with
MapR support or continue to seek out help from the community.  It was
suggested that he do both.

Re: [ANNOUNCE] New PMC Chair of Apache Drill

2016-05-25 Thread Zelaine Fong

Congratulations, Parth.  Looking forward to working with in your new role
:).

-- Zelaine

On Wed, May 25, 2016 at 9:02 AM, Jinfeng Ni  wrote:

> Big congratulations, Parth!
>
> Thank you, Jacques, for your contribution and leadership over the last
> few years!
>
>
> On Wed, May 25, 2016 at 8:35 AM, Jacques Nadeau 
> wrote:
> > I'm pleased to announce that the Drill PMC has voted to elect Parth
> Chandra
> > as the new PMC chair of Apache Drill. Please join me in congratulating
> > Parth!
> >
> > thanks,
> > Jacques
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
>

Re: Regarding Apache Drill 1.7 Version Release Date

2016-05-25 Thread Zelaine Fong

Same response as the one I provided about a week ago :).  But it may have
gotten lost as that email thread was asking other questions as well.

We're in the process of re-evaluating the release cadence of Drill, and
will no longer be doing releases on a monthly cadence.  At this time, we do
not have a firm date for the next release, but it will likely be in the
next couple of months.

-- Zelaine

On Wed, May 25, 2016 at 4:48 AM, John Omernik  wrote:

> I am also interested in this. I know there was some talk about this and
> then some of that work was focused on the 2.0 Anyone care to talk about 1.7
> vs 2.0, and what these various releases could me to the community?
>
> Thanks!
>
> John
>
> On Wed, May 25, 2016 at 12:44 AM, Sanjiv Kumar 
> wrote:
>
> > Hello
> >Would you please tell me when Apache Drill 1.7 Version will going
> to
> > release?.
> >
> >
> >  ..
> >   Thanks & Regards
> >Sanjiv Kumar
> >
>

Re: "user" as a reserved word

2016-05-24 Thread Zelaine Fong

I agree that this is a bug, as you've already noted via DRILL-4692.
Putting backticks around USER should treat it as an identifier, not a
reserved word or special function.  Based on Jinfeng's findings, it looks
like Drill puts special function names in the same namespace as
identifiers, resulting in the current behavior.

-- Zelaine

On Tue, May 24, 2016 at 5:14 AM, John Omernik <j...@omernik.com> wrote:

> I think if there is any change that one of the interpretations could be to
> allow `user` (backtick user backtick) that we should do that... Drill uses
> backticks as the quoted identifier (I gather Postgres uses double quotes as
> it's identifier) having user be the column name will same many
> organizations from errorless "wrong" results... I just worry about the down
> stream there... I'd probably feel less strongly about this if A. the word
> was less likely to be a column name (like current_drill_user vs user) and
> B. I wasn't a user with a background in security and see all the data
> sources that use "user" as a column.   I just have this sinking feeling it
> will lead to bad things in data analysis for many users.
>
> I guess it comes down to the principle of least surprise, I believe that in
> this case, `user` providing the the column name user is less surprising
> then `user` providing the current drill user, and if we can make it align
> with out SQL systems (like how Postgres behaves with ITS quoted identifier)
> than I think we have a compelling case for changing how drill reacts here.
>
> Thoughts?
>
> (Once again, thanks for continued follow-up here, I love talking about
> issues like these)
>
> John
>
>
>
> On Mon, May 23, 2016 at 10:39 PM, Jinfeng Ni <jinfengn...@gmail.com>
> wrote:
>
> > mydb=# select  "user" from t1;
> >  user
> > --
> >  ABC
> >
> > I should take back what I said. With quoted identifier, Postgres
> > behaved different from Drill. Both of the interpretations seem to be
> > reasonable, since the identifier could represent two different things.
> >
> >
> > On Mon, May 23, 2016 at 7:41 PM, Zelaine Fong <zf...@maprtech.com>
> wrote:
> > > Jinfeng,
> > >
> > > What does postgres return for the following query in your example?
> > >
> > > select "user" from t1;
> > >
> > > -- Zelaine
> > >
> > > On Mon, May 23, 2016 at 7:39 PM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > >> Hmm, you are correct, I don't have to like it :) but there is both
> logic
> > >> and precedence here.  Thanks for following up
> > >>
> > >> John
> > >>
> > >> On Monday, May 23, 2016, Jinfeng Ni <jinfengn...@gmail.com> wrote:
> > >>
> > >> > An quoted identifier is still an identifier (Drill uses back tick as
> > >> > quote). Per SQL standard,  identifier CURRENT_USER / USER/
> > >> > CURRENT_SESSION/etc are implicit function calls; no () is required.
> > >> >
> > >> > I checked Postgre, and seems it has the same behavior.
> > >> >
> > >> > mydb=# create table t1 (id int, "user" varchar(10));
> > >> >
> > >> > mydb=# insert into t1 values(100, 'ABC');
> > >> > INSERT 0 1
> > >> >
> > >> > mydb=# select * from t1;
> > >> >  id  | user
> > >> > -+--
> > >> >  100 | ABC
> > >> > (1 row)
> > >> >
> > >> > mydb=# select user from t1;
> > >> >  current_user
> > >> > --
> > >> >  postgres
> > >> > (1 row)
> > >> >
> > >> > mydb=# select t1.user from t1;
> > >> >  user
> > >> > --
> > >> >  ABC
> > >> > (1 row)
> > >> >
> > >> >
> > >> >
> > >> > On Mon, May 23, 2016 at 5:12 PM, John Omernik <j...@omernik.com
> > >> > <javascript:;>> wrote:
> > >> > > Can (should) things inside back ticks be callable? I guess this
> > makes a
> > >> > > very difficult situation from a usability standpoint because user
> > is a
> > >> > not
> > >> > > uncommon column name (think security logs, web logs, etc) yet in
> the
> > >> > > current setup there is lots of possibility for assumptions on
> > calling
> > >> > back
> > >> > &g

Re: "user" as a reserved word

2016-05-23 Thread Zelaine Fong

Jinfeng,

What does postgres return for the following query in your example?

select "user" from t1;

-- Zelaine

On Mon, May 23, 2016 at 7:39 PM, John Omernik  wrote:

> Hmm, you are correct, I don't have to like it :) but there is both logic
> and precedence here.  Thanks for following up
>
> John
>
> On Monday, May 23, 2016, Jinfeng Ni  wrote:
>
> > An quoted identifier is still an identifier (Drill uses back tick as
> > quote). Per SQL standard,  identifier CURRENT_USER / USER/
> > CURRENT_SESSION/etc are implicit function calls; no () is required.
> >
> > I checked Postgre, and seems it has the same behavior.
> >
> > mydb=# create table t1 (id int, "user" varchar(10));
> >
> > mydb=# insert into t1 values(100, 'ABC');
> > INSERT 0 1
> >
> > mydb=# select * from t1;
> >  id  | user
> > -+--
> >  100 | ABC
> > (1 row)
> >
> > mydb=# select user from t1;
> >  current_user
> > --
> >  postgres
> > (1 row)
> >
> > mydb=# select t1.user from t1;
> >  user
> > --
> >  ABC
> > (1 row)
> >
> >
> >
> > On Mon, May 23, 2016 at 5:12 PM, John Omernik  > > wrote:
> > > Can (should) things inside back ticks be callable? I guess this makes a
> > > very difficult situation from a usability standpoint because user is a
> > not
> > > uncommon column name (think security logs, web logs, etc) yet in the
> > > current setup there is lots of possibility for assumptions on calling
> > back
> > > tick user back tick and without an error users may have wrong, but
> > "error"
> > > free results.
> > > On May 23, 2016 4:54 PM, "Jinfeng Ni"  > > wrote:
> > >
> > >> The problem here is that identifier 'user' is not only a reserved
> > >> word, but also represents a special function ==  current_user() call.
> > >> The identifier 'user', whether it's quoted or not, could mean either
> > >> column name or the function call.  Without the table alias, it could
> > >> be ambiguous to sql parser. The table alias informs the parser that
> > >> this identifier is not a function call, but a regular identifier, thus
> > >> removes the ambiguity.
> > >>
> > >> This is different from other cases you use quoted reserved word to
> > >> represent a column name, since those reserved words do not represent a
> > >> special function, thus no ambiguity.
> > >>
> > >> select `update`, `insert` from dfs.tmp.`1.json`;
> > >> +-+-+
> > >> | update  | insert  |
> > >> +-+-+
> > >> | abc | 100 |
> > >> +-+-+
> > >>
> > >>
> > >>
> > >> On Mon, May 23, 2016 at 10:44 AM, John Omernik  > > wrote:
> > >> > Ya, as I am testing, this works, however, the users of the system
> > expect
> > >> to
> > >> > be able to use `user` and while I can provide them instructions to
> > use a
> > >> > table alias, I am very worried that they will forget and since it
> > doesn't
> > >> > error, but instead puts in a different string, this could lead to
> bad
> > >> > downstream results...
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, May 23, 2016 at 12:41 PM, John Omernik  > > wrote:
> > >> >
> > >> >> I filed https://issues.apache.org/jira/browse/DRILL-4692
> > >> >>
> > >> >> I see an alias would work as a tmp fix, but this should be address
> (I
> > >> >> wonder if other words may have a problem too?)
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Mon, May 23, 2016 at 12:38 PM, Andries Engelbrecht <
> > >> >> aengelbre...@maprtech.com > wrote:
> > >> >>
> > >> >>> Hmm interesting.
> > >> >>>
> > >> >>> As a workaround just use a table alias when referencing the
> column.
> > >> >>>
> > >> >>>
> > >> >>> Might be good to se if there is a JIRA for this, or file one if
> not.
> > >> >>>
> > >> >>> --Andries
> > >> >>>
> > >> >>> > On May 23, 2016, at 10:28 AM, John Omernik  > > wrote:
> > >> >>> >
> > >> >>> > I have data with a field name user.
> > >> >>> >
> > >> >>> > When I select, with backticks, it doesn't show the field, but
> > >> instead my
> > >> >>> > current logged in user...
> > >> >>> >
> > >> >>> >
> > >> >>> > select CONVERT_FROM(`user`, 'UTF8') as `user` from table limit
> 10;
> > >> >>> >
> > >> >>> >
> > >> >>> > Shouldn't the backticks allow me to reference the field
> properly?
> > >> >>> >
> > >> >>> > John
> > >> >>>
> > >> >>>
> > >> >>
> > >>
> >
>
>
> --
> Sent from my iThing
>

Re: Regarding Excel Files and Ms Access File.

2016-05-18 Thread Zelaine Fong

Sanjiv,

As noted in the reply to your first email on this question, Drill does not
currently have such a plugin.  We're in the process of re-evaluating the
release cadence of Drill, and will no longer be doing releases on a monthly
cadence.  At this time, we do not have a firm date for the next release,
but it likely be in the next couple of months.

-- Zelaine

On Tue, May 17, 2016 at 10:57 PM, Sanjiv Kumar  wrote:

> This is my second mail regarding  excel file (.xsl file) and Ms Access
> Files.
>
> When  Drill provide Storage Plugin for Excel File And Ms Access File?
>
> And When Drill latest version 1.7 is going to be release.? Can i
> aspect that Storage Plugin for excel file will be there in Latest
> Version (i.e 1.7)??
>
>
>
>
>
>
>
>
>  ..
>   Thanks & Regards
>   *Sanjiv Kumar*
>

Minutes from 5/17/16 Drill Hangout

2016-05-17 Thread Zelaine Fong

Attendees: Hakim, Parth, Paul, Jinfeng, Aman, Neeraja, Andries, Vitalii,
John O, Wojtek, Khurram, Gautam, Jeff Downton

1) Andries noted that  BI tool vendors are hitting DRILL-3510 and
DRILL-4682 in the SQL dialect they're generating.  I'll have folks on the
MapR end look into fixing these.

2) Vitalii is working on DRILL-4673.  He asked if we should add support for
the equivalent functionality with views.  John indicated yes.  Vitalii will
add it as part of fixing this for tables.

3) Aman would like feedback on DRILL-4679.  He'll initiate a discussion on
the Jira.

4) John talked about his work getting Caravel to work with Drill.  He has a
docker container that allows Caravel to talk to Drill over ODBC.  More work
is needed to get the SQL dialect working with Drill, and he'd love it if
others in the community can help.  Neeraja indicated that she will give it
a try and knows a couple of folks on the MapR end who are interested in
helping.

Re: query from hbase issue

2016-05-17 Thread Zelaine Fong

Can you provide the CREATE TABLE statement you used to reproduce this
problem so we can try to reproduce it on our end.

Thanks.

-- Zelaine

On Tue, May 17, 2016 at 4:50 AM, qiang li  wrote:

> Hi ,
>
> I recently meet a issue that can not query the correct data from hbase with
> sql by drill, can anybody help me.
>
> I test with the drill 1.6.
> My hbase scheme:
> rowkey: salt+day+event+uid + ts , eg: 120160411visituidts
> cf : v
> qualifier: v, e0, e1
>
> The wrong result only happened when I use group by clause.
>
> This sql will not return correct result:
> select CONVERT_FROM(a.`v`.`e0`, 'UTF8') as k, count(a.`v`.`e0`) p from
> hbase.browser_action2 a where a.row_key > '0'  group by a.`v`.`e0`;
> Part of explain of this sql is:
>
> 0: jdbc:drill:zk=rfdc5> explain plan for select CONVERT_FROM(a.`v`.`e0`,
> 'UTF8') as k, count(a.`v`.`e0`) p from hbase.browser_action2 a where
> a.row_key > '0'  group by a.`v`.`e0`;
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  Project(k=[$0], p=[$1])
> 00-02UnionExchange
> 01-01  Project(k=[CONVERT_FROMUTF8($0)], p=[$1])
> 01-02HashAgg(group=[{0}], p=[$SUM0($1)])
> 01-03  Project($f0=[$0], p=[$1])
> 01-04HashToRandomExchange(dist0=[[$0]])
> 02-01  UnorderedMuxExchange
> 03-01Project($f0=[$0], p=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02  HashAgg(group=[{0}], p=[COUNT($0)])
> 03-03Project($f0=[ITEM($1, 'e0')])
> 03-04  Scan(groupscan=[HBaseGroupScan
> [HBaseScanSpec=HBaseScanSpec [tableName=browser_action2, startRow=0\x00,
> stopRow=, filter=null], columns=[`*`]]])
>
> The data return very quickly , the result of this sql is :
> +--++
> |  k   |   p|
> +--++
> | pay  | 12180  |
> +--+
>
> But I have millons of data in the table.
>
> I tried to change the physical plan.  if I change the json explain
> *"columns"
> : [ "`*`" ]*  to *"columns" : [ "`v`.`e0`" ] *, it will return the correct
> result.
>
> It seems the physical plan is not correct.
> I also try to debug the sql parser to find out the reason, but its too
> complicate. Can anyone help me.
>
> Also this sql have the same issue.
> select CONVERT_FROM(BYTE_SUBSTR(a.row_key, 1 , 9), 'UTF8')  as k,
> count(a.row_key) p from hbase.browser_action2 a group by
> BYTE_SUBSTR(a.row_key, 1 , 9);
> I change the json explain *"columns" : [ "`*`" ]*  to *"columns" : [
> "`row_key`" ] *, it will return the correct result.
>

Re: CTAS Out of Memory

2016-05-13 Thread Zelaine Fong

Stefan,

Does your source data contain varchar columns?  We've seen instances where
Drill isn't as efficient as it can be when Parquet is dealing with variable
length columns.

-- Zelaine

On Fri, May 13, 2016 at 9:26 AM, Stefan Sedich 
wrote:

> Thanks for getting back to me so fast!
>
> I was just playing with that now, went up to 8GB and still ran into it,
> trying to go higher to see if I can find the sweet spot, only got 16GB
> total RAM on this laptop :)
>
> Is this an expected amount of memory for not an overly huge table (16
> million rows, 6 columns of integers), even now at a 12GB heap seems to have
> filled up again.
>
>
>
> Thanks
>
> On Fri, May 13, 2016 at 9:20 AM Jason Altekruse  wrote:
>
> > I could not find anywhere this is mentioned in the docs, but it has come
> up
> > a few times one the list. While we made a number of efforts to move our
> > interactions with the Parquet library to the off-heap memory (which we
> use
> > everywhere else in the engine during processing) the version of the
> writer
> > we are using still buffers a non-trivial amount of data into heap memory
> > when writing parquet files. Try raising your JVM heap memory in
> > drill-env.sh on startup and see if that prevents the out of memory issue.
> >
> > Jason Altekruse
> > Software Engineer at Dremio
> > Apache Drill Committer
> >
> > On Fri, May 13, 2016 at 9:07 AM, Stefan Sedich 
> > wrote:
> >
> > > Just trying to do a CTAS on a postgres table, it is not huge and only
> has
> > > 16 odd million rows, I end up with an out of memory after a while.
> > >
> > > Unable to handle out of memory condition in FragmentExecutor.
> > >
> > > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > >
> > >
> > > Is there a way to avoid this without needing to do the CTAS on a subset
> > of
> > > my table?
> > >
> >
>

Re: View causing poor performance on row queries on MapR-DB (and perhaps HBASE)

2016-02-07 Thread Zelaine Fong

Oops, missed the "not" in one of my sentences.  The corrected sentence
should be:

You can tell that's the case because in the case of the view query, the
explain plan has a Filter operation, whereas in the case of the non-view
query you do NOT.

-- Zelaine

On Sun, Feb 7, 2016 at 11:09 AM, Zelaine Fong <zf...@maprtech.com> wrote:

> It looks like in the case of the non-view query, the WHERE clause is being
> pushed down to MapR-DB.  The pushdown isn't happening in the case of the
> view.  You can tell that's the case because in the case of the view query,
> the explain plan has a Filter operation, whereas in the case of the
> non-view query you do.  As you noted, in the case of the non-view, the Scan
> has a startRow/stopRow/filter, which corresponds to the pushdown of the
> filter.
>
> I'm wondering if the problem is related to the CONVERT_FROM() in the WHERE
> clause, or if all filters on views aren't being pushed down?
>
> -- Zelaine
>
> On Sat, Feb 6, 2016 at 2:02 PM, John Omernik <j...@omernik.com> wrote:
>
>> Hey all, I was running some queries on a MapR-DB Table I have.  I created
>> a
>> view to avoid forcing users to write queries that always included the
>> CONVERT_FROM statements. (I am a huge advocate of making things easy for
>> the the users and writing queries with CONVERT_FROM statements isn't
>> easy).
>>
>> I ran a query the other day on one of these views and noticed that a query
>> that took 30 seconds really shouldn't take 30 seconds.  What do I mean?
>> well I wanted to get part of a record by looking up the MapR-DB Row key
>> (equiv. to HBASE row key)  That should be an instant lookup.  Sure enough,
>> when I tried it in the hbase shell that returns instantly.  So why did
>> Drill take 30 seconds?  I shot an email to Ted and Jim at MapR to ask this
>> very question. Ted suggested that I try the query without a view.  Sure
>> enough, If I use the convert_from in a direct query, it's an instant (sub
>> second) return.  Thus it appears something in the view is not allowing the
>> query to short circuit the read.
>>
>> Ted suggests I post here  (I am curious if anyone who has HBASE setup is
>> seeing this same issue with views) but also include the EXPLAIN plan.
>> Basically, using my very limited ability to read EXPLAIN plans (If someone
>> has a pointer to a blog post or docs on how to read EXPLAIN I would love
>> that!) it looks like in the view the startRow and stopRow in the
>> hbaseScanSpec are not set, seeming to cause a scan.  Is there any away to
>> assist the planner when running this through a view so that we can get the
>> performance of the query without the view but with the easy of
>> use/readability of using the view?
>>
>> Thanks!!!
>>
>> John
>>
>> View Creation
>>
>> CREATE VIEW view_testpaste as
>> SELECT
>> CONVERT_FROM(row_key, 'UTF8') AS pasteid,
>> CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang,
>> CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
>> FROM dfs.`pastes`.`/pastes` pastes;
>>
>>
>> Select from view takes 32 seconds (seems to be a scan)
>>
>> > select paste from view_testpaste where pasteid = 'djHEHcPM'
>>
>> 1 row selected (32.302 seconds)
>>
>>
>> Just a direct select returns very fast (0.486 seconds)
>>
>> > select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
>> FROM dfs.`pastes`.`/pastes` pastes where
>> CONVERT_FROM(row_key, 'UTF8') = 'djHEHcPM';
>>
>> 1 row selected (0.486 seconds)
>>
>>
>>
>>
>> EXPLAIN PLAN FOR select paste from view_testpaste where pasteid =
>> 'djHEHcPM'
>>
>> +--+--+
>> | text | json |
>> +--+--+
>> | 00-00Screen
>> 00-01  UnionExchange
>> 01-01Project(paste=[CONVERT_FROMUTF8($1)])
>> 01-02  SelectionVectorRemover
>> 01-03Filter(condition=[=(CONVERT_FROMUTF8($0), 'djHEHcPM')])
>> 01-04  Project(row_key=[$1], ITEM=[ITEM($0, 'paste')])
>> 01-05Scan(groupscan=[MapRDBGroupScan
>> [HBaseScanSpec=HBaseScanSpec [tableName=maprfs:///data/pastebiner/pastes,
>> startRow=null, stopRow=null, filter=null], columns=[`row_key`,
>> `raw`.`paste`]]])
>>  | {
>>   "head" : {
>> "version" : 1,
>> "generator" : {
>>   "type" : "ExplainHandler",
>>   "info" : ""
>> },
>> "type" : "APACHE_DRILL_PHYSICAL",
>> "options" : [ ],
>> "queue" : 0,
>

Re: View causing poor performance on row queries on MapR-DB (and perhaps HBASE)

2016-02-07 Thread Zelaine Fong

It looks like in the case of the non-view query, the WHERE clause is being
pushed down to MapR-DB.  The pushdown isn't happening in the case of the
view.  You can tell that's the case because in the case of the view query,
the explain plan has a Filter operation, whereas in the case of the
non-view query you do.  As you noted, in the case of the non-view, the Scan
has a startRow/stopRow/filter, which corresponds to the pushdown of the
filter.

I'm wondering if the problem is related to the CONVERT_FROM() in the WHERE
clause, or if all filters on views aren't being pushed down?

-- Zelaine

On Sat, Feb 6, 2016 at 2:02 PM, John Omernik  wrote:

> Hey all, I was running some queries on a MapR-DB Table I have.  I created a
> view to avoid forcing users to write queries that always included the
> CONVERT_FROM statements. (I am a huge advocate of making things easy for
> the the users and writing queries with CONVERT_FROM statements isn't easy).
>
> I ran a query the other day on one of these views and noticed that a query
> that took 30 seconds really shouldn't take 30 seconds.  What do I mean?
> well I wanted to get part of a record by looking up the MapR-DB Row key
> (equiv. to HBASE row key)  That should be an instant lookup.  Sure enough,
> when I tried it in the hbase shell that returns instantly.  So why did
> Drill take 30 seconds?  I shot an email to Ted and Jim at MapR to ask this
> very question. Ted suggested that I try the query without a view.  Sure
> enough, If I use the convert_from in a direct query, it's an instant (sub
> second) return.  Thus it appears something in the view is not allowing the
> query to short circuit the read.
>
> Ted suggests I post here  (I am curious if anyone who has HBASE setup is
> seeing this same issue with views) but also include the EXPLAIN plan.
> Basically, using my very limited ability to read EXPLAIN plans (If someone
> has a pointer to a blog post or docs on how to read EXPLAIN I would love
> that!) it looks like in the view the startRow and stopRow in the
> hbaseScanSpec are not set, seeming to cause a scan.  Is there any away to
> assist the planner when running this through a view so that we can get the
> performance of the query without the view but with the easy of
> use/readability of using the view?
>
> Thanks!!!
>
> John
>
> View Creation
>
> CREATE VIEW view_testpaste as
> SELECT
> CONVERT_FROM(row_key, 'UTF8') AS pasteid,
> CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang,
> CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
> FROM dfs.`pastes`.`/pastes` pastes;
>
>
> Select from view takes 32 seconds (seems to be a scan)
>
> > select paste from view_testpaste where pasteid = 'djHEHcPM'
>
> 1 row selected (32.302 seconds)
>
>
> Just a direct select returns very fast (0.486 seconds)
>
> > select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
> FROM dfs.`pastes`.`/pastes` pastes where
> CONVERT_FROM(row_key, 'UTF8') = 'djHEHcPM';
>
> 1 row selected (0.486 seconds)
>
>
>
>
> EXPLAIN PLAN FOR select paste from view_testpaste where pasteid =
> 'djHEHcPM'
>
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  UnionExchange
> 01-01Project(paste=[CONVERT_FROMUTF8($1)])
> 01-02  SelectionVectorRemover
> 01-03Filter(condition=[=(CONVERT_FROMUTF8($0), 'djHEHcPM')])
> 01-04  Project(row_key=[$1], ITEM=[ITEM($0, 'paste')])
> 01-05Scan(groupscan=[MapRDBGroupScan
> [HBaseScanSpec=HBaseScanSpec [tableName=maprfs:///data/pastebiner/pastes,
> startRow=null, stopRow=null, filter=null], columns=[`row_key`,
> `raw`.`paste`]]])
>  | {
>   "head" : {
> "version" : 1,
> "generator" : {
>   "type" : "ExplainHandler",
>   "info" : ""
> },
> "type" : "APACHE_DRILL_PHYSICAL",
> "options" : [ ],
> "queue" : 0,
> "resultMode" : "EXEC"
>   },
>   "graph" : [ {
> "pop" : "maprdb-scan",
> "@id" : 65541,
> "userName" : "darkness",
> "hbaseScanSpec" : {
>   "tableName" : "maprfs:///data/pastebiner/pastes",
>   "startRow" : "",
>   "stopRow" : "",
>   "serializedFilter" : null
> },
> "storage" : {
>   "type" : "file",
>   "enabled" : true,
>   "connection" : "maprfs:///",
>   "workspaces" : {
> "root" : {
>   "location" : "/",
>   "writable" : false,
>   "defaultInputFormat" : null
> },
>  "pastes" : {
>   "location" : "/data/pastebiner",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "dev" : {
>   "location" : "/data/dev",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "hive" : {
>   "location" : "/user/hive",
>   "writable" : true,
>   "defaultInputFormat" : null
> },
> "tmp" : {
>   "location" : "/tmp",
>   "writable" : true,
>   "defaultInputFormat" : null
> }
>

42 matches

Mail list logo