Re: [ANNOUNCE] Apache Drill 1.17.0 Released

2019-12-30 Thread Aman Sinha
Congratulations on a great release ! Seems quite feature rich.
Also thanks to Volodymyr for shepherding the release even during the
holiday season.

On Fri, Dec 27, 2019 at 2:34 AM Arina Yelchiyeva 
wrote:

> Congrats everyone, great job!
>
> Kind regards,
> Arina
>
> > On 26 Dec 2019, at 20:32, Igor Guzenko 
> wrote:
> >
> > Many thanks to everyone who has contributed to the release! Great work!
> >
> >> On Thu, Dec 26, 2019 at 8:27 PM Abhishek Girish 
> wrote:
> >>
> >> Congratulations, everyone!
> >>
> >> On Thu, Dec 26, 2019 at 10:20 AM Volodymyr Vysotskyi <
> volody...@apache.org
> >>>
> >> wrote:
> >>
> >>> On behalf of the Apache Drill community, I am happy to announce the
> >> release
> >>> of Apache Drill 1.17.0.
> >>>
> >>> Drill is an Apache open-source SQL query engine for Big Data
> exploration.
> >>> Drill is designed from the ground up to support high-performance
> analysis
> >>> on the semi-structured and rapidly evolving data coming from modern Big
> >>> Data applications, while still providing the familiarity and ecosystem
> of
> >>> ANSI SQL, the industry-standard query language. Drill provides
> >>> plug-and-play integration with existing Apache Hive and Apache HBase
> >>> deployments.
> >>>
> >>> For information about Apache Drill, and to get involved, visit the
> >> project
> >>> website [1].
> >>>
> >>> Total of 200 JIRA's are resolved in this release of Drill with
> following
> >>> new features and improvements [2]:
> >>>
> >>> - Hive complex types support (DRILL-7251,
> >>> DRILL-7252, DRILL-7253, DRILL-7254)
> >>> - ESRI Shapefile (shp) (DRILL-4303) and Excel (DRILL-7177) format
> >>> plugins support
> >>> - Drill Metastore support (DRILL-7272, DRILL-7273, DRILL-7357)
> >>> - Upgrade to HADOOP-3.2 (DRILL-6540)
> >>> - Schema Provision using File / Table Function (DRILL-6835)
> >>> - Parquet runtime row group pruning (DRILL-7062)
> >>> - User-Agent UDFs (DRILL-7343)
> >>> - Canonical Map support (DRILL-7096)
> >>> - Kafka storage plugin improvements
> >>> (DRILL-6739, DRILL-6723, DRILL-7164, DRILL-7290, DRILL-7388)
> >>>
> >>> For the full list please see release notes [3].
> >>>
> >>> The binary and source artifacts are available here [4].
> >>>
> >>> Thanks to everyone in the community who contributed to this release!
> >>>
> >>> 1. https://drill.apache.org/
> >>> 2. https://drill.apache.org/blog/2019/12/26/drill-1.17-released/
> >>> 3. https://drill.apache.org/docs/apache-drill-1-17-0-release-notes/
> >>> 4. https://drill.apache.org/download/
> >>>
> >>> Kind regards,
> >>> Volodymyr Vysotskyi
> >>>
> >>
>


Re: Drill Resources Information

2019-10-14 Thread Aman Sinha
Hi Charles,
Resource provisioning is a broad area and workload specific but perhaps the
following presentations and doc links might help:

[1]
https://www.slideshare.net/MapRTechnologies/putting-apache-drill-into-production
[2] MapR specific but the concepts should be generally applicable :
https://mapr.com/docs/61/Drill/configuring_apache_drill.html

On Sat, Oct 12, 2019 at 9:16 PM Charles Givre  wrote:

> Hello all,
> Is there any good information out there about Drill's resource
> recommendations?  IE:  How many servers, cores, memory etc. would be
> recommended for a given data set?
> Thanks,
> -- C


Re: [ANNOUNCE] New PMC Chair of Apache Drill

2019-08-23 Thread Aman Sinha
Congratulations Charles !   And thank you Arina !

-Aman

On Thu, Aug 22, 2019 at 9:11 PM Divya Gehlot  wrote:

> Congratulations Charles !
> Looking forward for much better Drill and more addition in your book as
> well :)
>
> Thanks ,
> Divya
>
> On Fri, 23 Aug 2019 at 12:07 PM, Bhargava Vadlamani <
> vadlamani1...@gmail.com>
> wrote:
>
> > Congratulations Charles !!
> >
> > On Thu, Aug 22, 2019, 8:56 PM Abhishek Girish 
> wrote:
> >
> > > Congratulations, Charles!! Looking forward to what's next.
> > >
> > > Thanks a lot Arina, for your leadership in the last year. I think we
> may
> > > have added more committers and PMC members in your tenure, than ever
> > > before. The community is growing well, and I'm so glad to be a part of
> > it.
> > >
> > > On Thu, Aug 22, 2019 at 8:30 PM Kunal Khatua 
> > > wrote:
> > >
> > > > Congratulations, Charles!
> > > >
> > >
> >
>


Re: Column alias in group by behavior in 1.16

2019-08-20 Thread Aman Sinha
The change in behavior occurred in Drill 1.15 when the group-by alias
support was added [1].  Before this, we could not even group by
an alias in the SELECT list.  However, as Arina mentioned, the behavior is
dependent on Calcite which is used by Drill.
Does MySQL or other systems behave the same as Postgres for this scenario ?

[1] https://issues.apache.org/jira/browse/DRILL-1248

On Tue, Aug 20, 2019 at 6:43 AM Arina Yelchiyeva 
wrote:

> Resolutions of aliases and columns names is not done in Drill, on the
> contrary, it done by Calcite.
> There were some discussions around inconsistencies you mentioning but it's
> up to the Calcite community to fix them.
>
> [1] https://issues.apache.org/jira/browse/CALCITE-2799
>
> Kind regards,
> Arina
>
> > On Aug 20, 2019, at 2:54 AM, Jiang Wu 
> wrote:
> >
> > Hi folks, we are testing an upgrade to 1.16 from 1.14 and noticed a
> > behavior change in 1.16 related to the support for allowing select column
> > alias in group by clause.  When there is an ambiguity to the alias name,
> > Drill 1.16 is exhibiting a different behavior comparing to expectation.
> >
> > Here is an example to illustrate:
> >
> >
> > *-- Test query 1:*
> >
> > select *MOD(x,10) as x *
> > from (select 14 *as x* UNION ALL select 4 *as x*) TableA
> > *group by x;  --- x refers to select alias  from column name?*
> >
> > --- returns 2 rows if x refers to from clause table column x
> > --- returns 1 row if x refers to select column alias x
> >
> > *-- Test query 2:*
> >
> > select *MOD(x,10) as x *
> > from (select 14 *as x* UNION ALL select 4 *as x*) TableA
> > *group by MOD(x,10);  **  --- x refers to select alias  from column
> > name?*
> >
> > --- returns 1 row as  x here should refer to the from clause table
> column x
> >
> > 
> >
> > *PostgreSQL 9.6*
> > Test Query 1 --> 2 rows  (==> x refers to from clause table column)
> > Test Query 2 --> 1 row (==> x refers to from clause table column)
> >
> > *Drill 1.14 -- same as PostgreSQL 9.6*
> > Test Query 1 --> 2 rows (==> x refers to from clause table column)
> > Test Query 2 --> 1 row (==> x refers to from clause table column)
> >
> > *Drill 1.16 -- different results:*
> > Test Query 1 --> 1 row (==> x refers to select column alias)
> > Test Query 2 --> SQL Error: VALIDATION ERROR: At line 1, column 12:
> > Expression 'TableA.x' is not being grouped (==> also implies that x
> refers
> > to column alias)
> >
> > Given the logical order of processing for a SQL statement: from --> where
> > --> group by --> ... --> select, ... it would seem that precedence order
> > should be given to the from clause in case of ambiguity.  Also to be
> > consistent with test query 2, the alias in group by also refer to from
> > clause before select clause.  Love to hear more expert opinions here on
> > this issue.
> >
> > -- Jiang
>
>


Re: Percentile window function in apache drill

2019-08-12 Thread Aman Sinha
Drill  currently supports the ranking window functions [1] but not
specifically the percentile function.

[1] https://drill.apache.org/docs/ranking-window-functions/

On Mon, Aug 12, 2019 at 7:34 AM Ted Dunning  wrote:

> Currently, the way to do this is with window queries where you sort each
> sub-group and grab the pertinent rows as an approximation of the quantiles
> you want.
>
> Another way would be to use an approximate data structure like a t-digest
> via an aggregating user-defined function (UDF).
>
> Last I checked, there were some limitations in the ability to do this, but
> things may have changed.
>
>
> On Sun, Aug 11, 2019 at 9:15 PM Manoj srivatsav  >
> wrote:
>
> > Hi,
> >
> > I have a time series data store in MapR DB JSON tables. Using drill to
> > query data out of it.
> > I need find 90th and 95 percentile values on the data that is there.
> > Time series data is that of multiple sensors data per time.
> >
> > I was able to get min, max and avg values by doing
> > select `time`, min(`sensor1`), max(`sensor`), avg(`sensor1`) from
> > dfs.`maprdb` group by `timestamp` order by `timestamp`.
> >
> > I need to if there is a way to specify a percentile value to calculate in
> > query.
> >
> > --
> > Thanks and Regards,
> > Manoj Srivatsav
> >
>


Re: Apache Drill Hangout July 23rd

2019-07-23 Thread Aman Sinha
Perhaps next time Weijie :)
In today's hangout we discussed the progress for Drill's Metastore and Hive
complex type support.
We also heard from 2 contributors who have gotten Drill to run on ARM
processors and are going to present their findings in ApacheCon 2019.
There was a question about where to find Drill's docker images.  Here are
the links:
Doc: http://drill.apache.org/docs/running-drill-on-docker/
Images: https://hub.docker.com/r/drill/apache-drill/tags


On Tue, Jul 23, 2019 at 10:50 AM weijie tong 
wrote:

> Well, sorry about the missing time . I forgot to set the alarm and
> overslept. Now I can't join the meeting, maybe it has finished. I will
> issue the ParallelHashJoin PR recently.
>
> On Tue, Jul 23, 2019 at 10:14 AM Aman Sinha  wrote:
>
> > Hi Drillers,
> >
> > We will have our bi-weekly hangout tomorrow, July 23rd, at 10 AM PST
> > (link: https://meet.google.com/yki-iqdf-tai ).
> >
> > If there are any topics you would like to discuss during the hangout
> please
> > respond to this email.
> >
> > I believe last time Weijie mentioned he could talk about the hash join
> > enhancements he is working on.   Would be very interesting !
> >
> >
> > -Aman
> >
>


Re: [ANNOUNCE] New Committer: Bohdan Kazydub

2019-07-23 Thread Aman Sinha
Congratulations Bohdan and thanks much for your contributions !

On Tue, Jul 23, 2019 at 3:13 AM Igor Guzenko 
wrote:

> Congratulations, Bohdan! Great job !!!
>
> On Tue, Jul 23, 2019 at 12:33 PM Volodymyr Vysotskyi  >
> wrote:
>
> > Congratulations, Bohdan! Thanks for your contributions!
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Thu, Jul 18, 2019 at 4:55 PM Kunal Khatua  wrote:
> >
> > > Congratulations, Bohdan!
> > > On 7/16/2019 2:21:14 PM, Robert Hou  wrote:
> > > Congratulations, Bohdan. Thanks for contributing to Drill!
> > >
> > > --Robert
> > >
> > > On Tue, Jul 16, 2019 at 11:50 AM hanu mapr wrote:
> > >
> > > > Congratulations Bohdan!
> > > >
> > > > On Tue, Jul 16, 2019 at 9:30 AM Gautam Parai wrote:
> > > >
> > > > > Congratulations Bohdan!
> > > > >
> > > > > Gautam
> > > > >
> > > > > On Mon, Jul 15, 2019 at 11:53 PM Bohdan Kazydub
> > > > bohdan.kazy...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thank you all for your support!
> > > > > >
> > > > > > On Tue, Jul 16, 2019 at 4:16 AM weijie tong
> > > > > > wrote:
> > > > > >
> > > > > > > Congrats Bohdan!
> > > > > > >
> > > > > > > On Tue, Jul 16, 2019 at 12:54 AM Vitalii Diravka
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Congrats Bohdan! Well deserved!
> > > > > > > >
> > > > > > > > Kind regards
> > > > > > > > Vitalii
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 6:48 PM Paul Rogers
> > > > >
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Congrats Bohdan!
> > > > > > > > > - Paul
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Monday, July 15, 2019, 01:08:04 AM PDT, Arina Ielchiieva
> > > > > > > > > ar...@apache.org> wrote:
> > > > > > > > >
> > > > > > > > > The Project Management Committee (PMC) for Apache Drill has
> > > > > invited
> > > > > > > > Bohdan
> > > > > > > > > Kazydub to become a committer, and we are pleased to
> announce
> > > > that
> > > > > he
> > > > > > > has
> > > > > > > > > accepted.
> > > > > > > > >
> > > > > > > > > Bohdan has been contributing into Drill for more than a
> year.
> > > His
> > > > > > > > > contributions include
> > > > > > > > > logging and various functions handling improvements,
> planning
> > > > > > > > optimizations
> > > > > > > > > and S3 improvements / fixes. His recent work includes
> Calcite
> > > > 1.19
> > > > > /
> > > > > > > 1.20
> > > > > > > > > [DRILL-7200] and implementation of canonical Map
> > > > [DRILL-7096].
> > > > > > > > >
> > > > > > > > > Welcome Bohdan, and thank you for your contributions!
> > > > > > > > >
> > > > > > > > > - Arina
> > > > > > > > > (on behalf of the Apache Drill PMC)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Committer: Igor Guzenko

2019-07-23 Thread Aman Sinha
Congratulations Igor and thanks for your contributions to Drill !

On Tue, Jul 23, 2019 at 3:33 AM Anton Gozhiy  wrote:

> Congratulations Igor, well deserved!
>
> On Tue, Jul 23, 2019, 12:31 Volodymyr Vysotskyi 
> wrote:
>
> > Congratulations, Ihor! Thanks for your contributions!
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Mon, Jul 22, 2019 at 5:02 PM Arina Ielchiieva 
> wrote:
> >
> > > The Project Management Committee (PMC) for Apache Drill has invited
> Igor
> > > Guzenko to become a committer, and we are pleased to announce that he
> has
> > > accepted.
> > >
> > > Igor has been contributing into Drill for 9 months and made a number of
> > > significant contributions, including cross join syntax support, Hive
> > views
> > > support, as well as improving performance for Hive show schema and unit
> > > tests. Currently he is working on supporting Hive complex types
> > > [DRILL-3290]. He already added support for list type and working on
> > struct
> > > and canonical map.
> > >
> > > Welcome Igor, and thank you for your contributions!
> > >
> > > - Arina
> > > (on behalf of the Apache Drill PMC)
> > >
> >
>


Apache Drill Hangout July 23rd

2019-07-22 Thread Aman Sinha
Hi Drillers,

We will have our bi-weekly hangout tomorrow, July 23rd, at 10 AM PST
(link: https://meet.google.com/yki-iqdf-tai ).

If there are any topics you would like to discuss during the hangout please
respond to this email.

I believe last time Weijie mentioned he could talk about the hash join
enhancements he is working on.   Would be very interesting !


-Aman


Re: strange planning error

2019-05-24 Thread Aman Sinha
The changing of the column order in the SELECT should not materially affect
the planning, so this does look like a bug.   I was able to repro it with a
simpler example (without the group-by):
   select row_number() over (order by department_id desc) r,
department_id
from (select department_id
from  cp.`employee.json`
order by department_id desc) ;

Error: SYSTEM ERROR: CannotPlanException: Node
[rel#3170:Subset#4.LOGICAL.ANY([]).[1 DESC]] could not be implemented;
planner state:

I believe the 2 ORDER BYs are causing an issue with the plan generation but
more analysis is needed by looking at the Calcite trace.  However, that
brings up the question of why you need 2 ORDER BYs ?   If the ROW_NUMBER()
is already doing an ORDER BY DESC on the exact same column, *the  subquery
ORDER BY is redundant.*  If you remove it, the query runs OK...

apache drill> *select* row_number() *over* (*order* *by* cnt *desc*) r, cnt

. .semicolon> *from* (*select* *count*(1) *as* cnt

. . . . . .)>   *from*  cp.`employee.json`

. . . . . .)>   *group* *by* department_id);

*++-+*

*| **r ** | **cnt** |*

*++-+*

*| *1 * | *268* |*

*| *2 * | *264* |*

*| *3 * | *226* |*

*| *4 * | *222* |*

*| *5 * | *100* |*

*| *6 * | *32 * |*

*| *7 * | *16 * |*

*| *8 * | *9  * |*

*| *9 * | *7  * |*

*| *10* | *5  * |*

*| *11* | *4  * |*

*| *12* | *2  * |*

*++-+*

12 rows selected

In any case if you can file a JIRA for the original problem that would be
good.

Thanks.

On Fri, May 24, 2019 at 2:57 PM Ted Dunning  wrote:

> I have a bunch of data that I want to group up and look at the counts. In
> order to get a row number for plotting, I tried a window function.
>
> The data consists of about 7,2 million rows accessed via a view[1]. The
> columns are pretty much all untyped[2].
>
> This query works great:
>
>
> *with *
> *t0 as (*
> *  select count(1) cnt*
> *  from  dfs.flt.`flights-2018-01.csv` *
> *  group by columns[5]*
> *  order by cnt desc)*
> *select cnt, row_number() over (order by cnt desc) r*
> *from t0*
>
>
> But if I change the order of the columns like this:
>
> *with *
> *t0 as (*
> *  select count(1) cnt*
> *  from  dfs.flt.`flights-2018-01.csv` *
> *  group by columns[5]*
> *  order by cnt desc)*
> *select row_number() over (order by cnt desc) r, cnt*
> *from t0*
>
>
> I get the error below in query planning. That seems so very wrong.
>
> Any ideas? I know I can just avoid the issue, but I was hoping for some
> insight.
>
> For extra oddness points, if I use a common table expression and invert the
> field order, I get what I want with no error:
>
>
> *with *
> *t0 as (*
> *  select count(1) cnt*
> *  from  dfs.flt.`flights-2018-01.csv` *
> *  group by columns[5]*
> *  order by cnt desc),*
> *t1 as (*
> *  select cnt, row_number() over (order by cnt desc) r *
> *  from t0)*
> *select r, cnt*
> *from t1*
>
>
> This means that the planner is not rewriting this to eliminate the common
> table before planning.
>
>
> java.sql.SQLException: [MapR][DrillJDBCDriver](500165) Query execution
> error. Details: SYSTEM ERROR: CannotPlanException: Node
> [rel#18615:Subset#8.LOGICAL.ANY([]).[1 DESC]] could not be implemented;
> planner state: Root: rel#18615:Subset#8.LOGICAL.ANY([]).[1 DESC] Original
> rel: LogicalProject(subset=[rel#18615:Subset#8.LOGICAL.ANY([]).[1 DESC]],
> r=[$1], cnt=[$0]): rowcount = 10.0, cumulative cost = {10.0 rows, 20.0 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 18613
> LogicalWindow(subset=[rel#18612:Subset#7.NONE.ANY([]).[1 DESC]],
> window#0=[window(partition {} order by [0 DESC] rows between UNBOUNDED
> PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])]): rowcount = 10.0,
> cumulative cost = {10.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 18611 LogicalSort(subset=[rel#18610:Subset#6.NONE.ANY([]).[0 DESC]],
> sort0=[$0], dir0=[DESC]): rowcount = 10.0, cumulative cost = {10.0 rows,
> 92.10340371976184 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 18609
> LogicalProject(subset=[rel#18608:Subset#5.NONE.ANY([]).[]], cnt=[$1]):
> rowcount = 10.0, cumulative cost = {10.0 rows, 10.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 18607
> LogicalAggregate(subset=[rel#18606:Subset#4.NONE.ANY([]).[]], group=[{0}],
> cnt=[COUNT($1)]): rowcount = 10.0, cumulative cost = {11.25 rows, 0.0 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 18605
> LogicalProject(subset=[rel#18604:Subset#3.NONE.ANY([]).[]],
> dest_airport_id=[$5], $f1=[1]): rowcount = 100.0, cumulative cost = {100.0
> rows, 200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 18603
> LogicalProject(subset=[rel#18602:Subset#2.NONE.ANY([]).[]], fl_date=[$0],
> op_unique_carrier=[$1], tail_num=[$2], op_carrier_fl_num=[$3],
> origin_airport_id=[$4], dest_airport_id=[$5], crs_dep_time=[$6],
> dep_time=[$7], dep_delay=[$8], taxi_out=[$9], wheels_off=[$10],
> wheels_on=[$11], taxi_in=[$12], arr_time=[$13], arr_delay=[$14],
> air_time=[$15], distance=[$16]): rowcount 

Re: May Apache Drill board report

2019-05-03 Thread Aman Sinha
+1

On Fri, May 3, 2019 at 1:40 PM Volodymyr Vysotskyi 
wrote:

> Looks good, +1
>
>
> Пт, 3 трав. 2019 23:32 користувач Arina Ielchiieva 
> пише:
>
> > Hi all,
> >
> > please take a look at the draft board report for the last quarter and let
> > me know if you have any comments.
> >
> > Thanks,
> > Arina
> >
> > =
> >
> > ## Description:
> > - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> >   Storage.
> >
> > ## Issues:
> >  - There are no issues requiring board attention at this time.
> >
> > ## Activity:
> > - Since the last board report, Drill has released version 1.16.0,
> including
> >   the following enhancements:
> >   - CREATE OR REPLACE SCHEMA command to define a schema for text files
> >   - REFRESH TABLE METADATA command can generate metadata cache files for
> >   specific columns
> >   - ANALYZE TABLE statement to computes statistics on Parquet data
> >   - SYSLOG (RFC-5424) Format Plugin
> >   - NEAREST DATE function to facilitate time series analysis
> >   - Format plugin for LTSV files
> >   - Ability to query Hive views
> >   - Upgrade to SQLLine 1.7
> >   - Apache Calcite upgrade to 1.18.0
> >   - Several Drill Web UI improvements, including:
> >  - Storage plugin management improvements
> >  - Query progress indicators and warnings
> >  - Ability to limit the result size for better UI response
> >  - Ability to sort the list of profiles in the Drill Web UI
> >  - Display query state in query result page
> >  - Button to reset the options filter
> >
> > - Drill User Meetup will be held on May 22, 2019. Two talks are planned:
> >   - Alibaba's Usage of Apache Drill for querying a Time Series Database
> >   - What’s new with Apache Drill 1.16 & a demo of Schema Provisioning
> >
> > ## Health report:
> > - The project is healthy. Development activity as reflected in the pull
> >   requests and JIRAs is good.
> > - Activity on the dev and user mailing lists are stable.
> > - One PMC member was added in the last period.
> >
> > ## PMC changes:
> >
> > - Currently 24 PMC members.
> > - Sorabh Hamirwasia was added to the PMC on Fri Apr 05 2019
> >
> > ## Committer base changes:
> >
> > - Currently 51 committers.
> > - No new committers added in the last 3 months
> > - Last committer addition was Salim Achouche at Mon Dec 17 2018
> >
> > ## Releases:
> >
> > - 1.16.0 was released on Thu May 02 2019
> >
> > ## Mailing list activity:
> >
> > - d...@drill.apache.org:
> >- 406 subscribers (down -10 in the last 3 months):
> >- 2299 emails sent to list (1903 in previous quarter)
> >
> > - iss...@drill.apache.org:
> >- 17 subscribers (down -1 in the last 3 months):
> >- 2373 emails sent to list (2233 in previous quarter)
> >
> > - user@drill.apache.org:
> >- 582 subscribers (down -15 in the last 3 months):
> >- 235 emails sent to list (227 in previous quarter)
> >
> > ## JIRA activity:
> >
> > - 214 JIRA tickets created in the last 3 months
> > - 212 JIRA tickets closed/resolved in the last 3 months
> >
>


Re: [RESULT] [VOTE] Apache Drill Release 1.16.0 - RC2

2019-05-01 Thread Aman Sinha
Great !  Thanks for managing this release Sorabh !

On Wed, May 1, 2019 at 9:22 AM SorabhApache  wrote:

> Hi All,
> RC2 candidate for 1.16.0 passes the voting criteria. Thanks to everyone who
> has tested and voted for release candidate. The summary of voting is:
>
> Total Votes: 8
> 5x +1 (binding): Aman, Arina, Boaz, Sorabh, Vova
> 3x +1 (non-binding): Anton, Denys, Kunal
> No 0s or -1s.
>
> I'll start the process of publishing the release artifacts and send an
> announcement once it's propagated.
>
> Thanks,
> Sorabh
>


Re: [VOTE] Apache Drill Release 1.16.0 - RC2

2019-04-29 Thread Aman Sinha
Downloaded binary tarball on my Mac  and ran in embedded mode.
Verified Sorabh's release signature and the tar file's checksum
Did a quick glance through maven artifacts
Did some manual tests with TPC-DS  Web_Sales table and ran REFRESH METADATA
command against the same table
Checked runtime query profiles of above queries and verified COUNT(*),
COUNT(column) optimization is getting applied.
Also did a build from source on my linux VM.

RC2 looks good !   +1

On Fri, Apr 26, 2019 at 8:28 AM SorabhApache  wrote:

> Hi Drillers,
> I'd like to propose the third release candidate (RC2) for the Apache Drill,
> version 1.16.0.
>
> Changes since the previous release candidate:
> DRILL-7201: Strange symbols in error window (Windows)
> DRILL-7202: Failed query shows warning that fragments has made no progress
> DRILL-7207: Update the copyright year in NOTICE.txt file
> DRILL-7212: Add gpg key with apache.org email for sorabh
> DRILL-7213: drill-format-mapr.jar contains stale git.properties file
>
> The RC2 includes total of 220 resolved JIRAs [1].
> Thanks to everyone for their hard work to contribute to this release.
>
> The tarball artifacts are hosted at [2] and the maven artifacts are hosted
> at [3].
>
> This release candidate is based on commit
> 751e87736c2ddbc184b52cfa56f4e29c68417cfe located at [4].
>
> Please download and try out the release candidate.
>
> The vote ends at 04:00 PM UTC (09:00 AM PDT, 07:00 PM EET, 09:30 PM IST),
> May 1st, 2019
>
> [ ] +1
> [ ] +0
> [ ] -1
>
> Here is my vote: +1
>   [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12344284
>   [2] http://home.apache.org/~sorabh/drill/releases/1.16.0/rc2/
>   [3]
> https://repository.apache.org/content/repositories/orgapachedrill-1073/
>   [4] https://github.com/sohami/drill/commits/drill-1.16.0
>
> Thanks,
> Sorabh
>


Re: [VOTE] Apache Drill Release 1.16.0 - RC1

2019-04-26 Thread Aman Sinha
>>> 1 row selected (0.274 seconds)
> >>>>> apache drill>
> >>>>>
> >>>>> Thanks,
> >>>>> Khurram
> >>>>>
> >>>>> On Wed, Apr 24, 2019 at 9:52 AM SorabhApache wrote:
> >>>>>
> >>>>>> Hi Volodymyr/Anton,
> >>>>>> I can verify that I am seeing both the below issues as reported by
> >>> Anton
> >>>>>> and Volodymyr. I will investigate further why we are seeing these
> >>>> issues.
> >>>>>> Thanks for catching this. Can you please open JIRA's for these
> >> issues
> >>> ?
> >>>>>>
> >>>>>> 1) Wrong result for sys.version query when built from source
> >> tarball.
> >>>>>> 2) git.properties file in drill-format-mapr-1.16.0.jar has wrong
> >>> commit
> >>>> id
> >>>>>> but as Volodymyr mentioned because of order in which jars are picked
> >>> up
> >>>>>> it's not showing the issue when tried from prebuilt tarball.
> >>>>>>
> >>>>>> @Volodymyr Vysotskyi
> >>>>>> Regarding the GPG key I am not sure if we mandate it to use
> >>> apache.org,
> >>>>>> there other keys in the file which are using gmail address as well.
> >> As
> >>>> far
> >>>>>> as the signing person is authenticating the key and details
> >> associated
> >>>> with
> >>>>>> it, I think it should be fine. But since it's recommended I will use
> >>>>>> apache.org email address instead.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Sorabh
> >>>>>>
> >>>>>> On Wed, Apr 24, 2019 at 8:53 AM Volodymyr Vysotskyi
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Aman,
> >>>>>>>
> >>>>>>> There are two different issues connected with *git.properties*
> >> file.
> >>>>>>> Regarding the problem I have mentioned, prebuilt tar
> >>>>>>> (apache-drill-1.16.0.tar.gz) contains
> >> *drill-format-mapr-1.16.0.jar*
> >>>> jar
> >>>>>>> which contains a *git.properties* file with the incorrect version.
> >>>>>>> When *select * from sys.version* query is submitted, class loader
> >>> finds
> >>>>>> the
> >>>>>>> first file named as *git.properties* from the classpath (each drill
> >>> jar
> >>>>>>> contains its own *git.properties* file) and for my case file from
> >>>>>>> *drill-format-mapr-1.16.0.jar *is picked up, so the incorrect
> >> result
> >>> is
> >>>>>>> returned. But it may not be reproducible for other machines since
> >> it
> >>>>>>> depends on the order of files for the class loader.
> >>>>>>>
> >>>>>>> Regarding the problem Anton has mentioned, Drill should be built
> >> from
> >>>> the
> >>>>>>> sources (apache-drill-1.16.0-src.tar.gz), and for that version,
> >>>> *select *
> >>>>>>> from sys.version* returns the result without information about
> >>> commit.
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Volodymyr Vysotskyi
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 24, 2019 at 6:33 PM Aman Sinha wrote:
> >>>>>>>
> >>>>>>>> This works fine for me with the binary tarball that I installed on
> >>> my
> >>>>>>> Mac.
> >>>>>>>> ..it shows the correct commit message.
> >>>>>>>>
> >>>>>>>> Apache Drill 1.16.0
> >>>>>>>>
> >>>>>>>> "This isn't your grandfather's SQL."
> >>>>>>>>
> >>>>>>>> apache drill> *select* * *from* sys.version;
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> +-+--+---

Re: [VOTE] Apache Drill Release 1.16.0 - RC1

2019-04-24 Thread Aman Sinha
; directory:
> > > > #Generated by Git-Commit-Id-Plugin
> > > > #Mon Apr 22 09:52:07 PDT 2019
> > > > git.branch=cf5b758e0a4c22b75bfb02ac2653ff09415ddf53
> > > > git.build.host=SHamirw-E755.local
> > > > git.build.time=22.04.2019 @ 09\:52\:07 PDT
> > > > git.build.user.email=sor...@apache.org
> > > > git.build.user.name=Sorabh Hamirwasia
> > > > git.build.version=1.16.0
> > > > git.closest.tag.commit.count=0
> > > > git.closest.tag.name=drill-1.16.0
> > > > git.commit.id=cf5b758e0a4c22b75bfb02ac2653ff09415ddf53
> > > > git.commit.id.abbrev=cf5b758
> > > > git.commit.id.describe=drill-1.16.0-0-gcf5b758
> > > > git.commit.id.describe-short=drill-1.16.0-0
> > > > git.commit.message.full=[maven-release-plugin] prepare release
> > > drill-1.16.0
> > > > git.commit.message.short=[maven-release-plugin] prepare release
> > > > drill-1.16.0
> > > > git.commit.time=22.04.2019 @ 09\:08\:36 PDT
> > > > git.commit.user.email=sor...@apache.org
> > > > git.commit.user.name=Sorabh Hamirwasia
> > > > git.dirty=false
> > > > git.remote.origin.url=https\://github.com/apache/drill.git
> > > > git.tags=drill-1.16.0
> > > > git.total.commit.count=3568
> > > >
> > > > But looks like it doesn't get into the classpath.
> > > > Could someone take a look into this?
> > > >
> > > > Thanks!
> > > >
> > > > On Wed, Apr 24, 2019 at 11:50 AM Volodymyr Vysotskyi <
> > > volody...@apache.org>
> > > > wrote:
> > > >
> > > >> Hi Sorabh,
> > > >>
> > > >> Sorry for being picky, but looks like the key you have published was
> > > >> generated for non-apache email: sohami.apa...@gmail.com. According
> to
> > > the
> > > >> [1], it is highly recommended to use Apache email address as the
> > primary
> > > >> User-ID.
> > > >>
> > > >> [1] https://www.apache.org/dev/release-signing#user-id
> > > >>
> > > >> Kind regards,
> > > >> Volodymyr Vysotskyi
> > > >>
> > > >>
> > > >> On Wed, Apr 24, 2019 at 10:10 AM Jyothsna Reddy <
> > jyothsna@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > Built it from cloning the git branch and unit tests on my Linux VM
> > > (time
> > > >> > taken - 43 min).
> > > >> > Tested new features of metadata caching by creating v4 cache files
> > > using
> > > >> > new Refresh Metadata commands and manually verified the cache
> files.
> > > >> Tried
> > > >> > a few queries that use metadata cache and verified results.
> > > >> >
> > > >> > The release looks good to me +1.
> > > >> >
> > > >> > Thank you,
> > > >> > Jyothsna
> > > >> >
> > > >> >
> > > >> >
> > > >> > [image: Mailtrack]
> > > >> > <
> > > >> >
> > > >>
> > >
> >
> https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality5;
> > > >> > >
> > > >> > Sender
> > > >> > notified by
> > > >> > Mailtrack
> > > >> > <
> > > >> >
> > > >>
> > >
> >
> https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality5;
> > > >> > >
> > > >> > 04/24/19,
> > > >> > 12:09:52 AM
> > > >> >
> > > >> > On Wed, Apr 24, 2019 at 12:09 AM Jyothsna Reddy <
> > > jyothsna@gmail.com
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > Built it from cloning the git branch and unit tests on my Linux
> VM
> > > >> (time
> > > >> > > taken - 43 min).
> > > >> > > Tested new features of metadata caching by creating v4 cache
> files
> > > >> using
> > > >> > > new Refresh Metadata commands and manually verified the cache
> > files.
> > > >> > Tried
> > > >> > > a few queries that use metadata cache and verified results.
> > > >

Re: [VOTE] Apache Drill Release 1.16.0 - RC1

2019-04-23 Thread Aman Sinha
Hi Vova,
I added some thoughts in the DRILL-7195 JIRA.

Aman

On Tue, Apr 23, 2019 at 6:06 AM Volodymyr Vysotskyi 
wrote:

> Hi all,
>
> I did some checks and found the following issues:
> - DRILL-7195 
> - DRILL-7194 
> - DRILL-7192 
>
> One of them (DRILL-7194) is also reproduced on the previous version,
> another is connected with the new feature (DRILL-7192), so I don't think
> that we should treat them as blockers.
> The third one (DRILL-7195) is a regression and in some cases may cause the
> wrong results, so I think that it should be fixed before the release.
> Any thoughts?
>
> Kind regards,
> Volodymyr Vysotskyi
>
>
> On Mon, Apr 22, 2019 at 8:58 PM SorabhApache  wrote:
>
> > *< Please disregard previous email, one of the link is not correct in it.
> > Use the information in this email instead >*
> >
> > Hi Drillers,
> > I'd like to propose the second release candidate (RC1) for the Apache
> > Drill,
> > version 1.16.0.
> >
> > Changes since the previous release candidate:
> > DRILL-7185: Drill Fails to Read Large Packets
> > DRILL-7186: Missing storage.json REST endpoint
> > DRILL-7190: Missing backward compatibility for REST API with DRILL-6562
> >
> > Also below 2 JIRA's were created to separately track revert of protbuf
> > changes in 1.16.0:
> > DRILL-7188: Revert DRILL-6642: Update protocol-buffers version
> > DRILL-7189: Revert DRILL-7105 Error while building the Drill native
> client
> >
> > The RC1 includes total of 215 resolved JIRAs [1].
> > Thanks to everyone for their hard work to contribute to this release.
> >
> > The tarball artifacts are hosted at [2] and the maven artifacts are
> hosted
> > at [3].
> >
> > This release candidate is based on commit
> > cf5b758e0a4c22b75bfb02ac2653ff09415ddf53 located at [4].
> >
> > Please download and try out the release candidate.
> >
> > The vote ends at 06:00 PM UTC (11:00 AM PDT, 09:00 PM EET, 11:30 PM IST),
> > Apr 25th, 2019
> >
> > [ ] +1
> > [ ] +0
> > [ ] -1
> >
> > Here is my vote: +1
> >   [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12344284
> >   [2] http://home.apache.org/~sorabh/drill/releases/1.16.0/rc1/
> >   [3]
> > https://repository.apache.org/content/repositories/orgapachedrill-1067/
> >   [4] https://github.com/sohami/drill/commits/drill-1.16.0
> >
> > Thanks,
> > Sorabh
> >
> > >
> >
>


Re: Blocker on drill upgrade path

2019-04-19 Thread Aman Sinha
Interesting that it ran on 1.13..but I still think the new behavior is the
right one.  Several changes went into Calcite between Drill's 1.13 and 1.15
release, so I cannot identify when this behavior changed.   Can you use a
slightly different alias name ?  The following should work:
select max(last_name) *max_last_name* from cp.`employee.json` group by
last_name limit 5;

On Fri, Apr 19, 2019 at 2:24 PM Nitin Pawar  wrote:

> sorry  my bad. i meant the query which was failing was with alias
> following is output on drill 1.13.0
>
> bash-3.2$ bin/drill-embedded
> Apr 20, 2019 2:46:45 AM org.glassfish.jersey.server.ApplicationHandler
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29
> 01:25:26...
> apache drill 1.13.0-SNAPSHOT
> "a drill in the hand is better than two in the bush"
> 0: jdbc:drill:zk=local> select max(last_name) last_name from
> cp.`employee.json` group by
> . . . . . . . . . . . > last_name limit 5;
> ++
> | last_name  |
> ++
> | Nowmer |
> | Whelply|
> | Spence |
> | Gutierrez  |
> | Damstra|
> ++
>
>
> On Sat, Apr 20, 2019 at 1:40 AM Aman Sinha  wrote:
>
> > This is legal:
> >   select max(last_name)  from cp.`employee.json` group by last_name limit
> > 5;
> > But this is not:
> >   select max(last_name) last_name from cp.`employee.json` group by
> > last_name limit 5;
> >
> > The reason is the second query is aliasing the max() output to
> 'last_name'
> > which is being referenced in the group-by clause.  Referencing an
> aggregate
> > expr in the group-by is not allowed by SQL standards, hence Calcite
> (which
> > does the parsing and validation, not Drill) throws this error during
> > validation phase.  Detailed error stack is below.  I don't think this
> would
> > have worked in 1.13 either.  My guess is you may have run the first query
> > in 1.13 and that should still continue to work.
> >
> > Validation error thrown by Calcite:
> >
> > Caused By (org.apache.calcite.sql.validate.SqlValidatorException)
> Aggregate
> > expression is illegal in GROUP BY clause
> >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
> >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance():62
> >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
> >
> > java.lang.reflect.Constructor.newInstance():423
> >
> > org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
> >
> > org.apache.calcite.runtime.Resources$ExInst.ex():572
> >
> > org.apache.calcite.sql.SqlUtil.newContextException():787
> >
> > org.apache.calcite.sql.SqlUtil.newContextException():772
> >
> >
> >
> org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4788
> >
> >
> >
> org.apache.calcite.sql.validate.SqlValidatorImpl.validateGroupClause():3941
> >
> >
>  org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3306
> >
> > org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
> >
> > org.apache.calcite.sql.validate.AbstractNamespace.validate():84
> >
> >
> > org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():977
> >
> > org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():953
> >
> > org.apache.calcite.sql.SqlSelect.validate():216
> >
> >
> >
> >
> org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():928
> >
> > org.apache.calcite.sql.validate.SqlValidatorImpl.validate():632
> >
> > org.apache.drill.exec.planner.sql.SqlConverter.validate():207
> >
> > On Fri, Apr 19, 2019 at 12:39 PM Nitin Pawar 
> > wrote:
> >
> > > I think the error is not with storage plugin but with query parsing
> > >
> > > here is the exception
> > > 0: jdbc:drill:zk=local> select max(last_name) last_name from
> > > cp.`employee.json` group by last_name limit 5;
> > > Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 21:
> > > Aggregate expression is illegal in GROUP BY clause
> > >
> > > On Fri, Apr 19, 2019 at 11:57 PM Kunal Khatua 
> wrote:
> > >
> > > > (Replying on the Drill user list)
> > > >
> > > > This is odd. The CP storage plugin is inbuilt with Drill and that
> > hasn't
> > > > changed. 1.15 by itself works fine.
> > > >
> > > > What is the error you are seeing, Nitin?
> > > >
> > > >
> > > > On 4/18/2019 10:58:48 PM, Nitin Pawar 
> wrote:
> > > > Hi,
> > > >
> > > > We are trying to upgrade drill from 1.13 to 1.15
> > > > following query works in drill 1.13 but not in 1.15
> > > >
> > > > select max(last_name) from cp.`employee.json` group by last_name
> limit
> > 5
> > > >
> > > > can you let us know if this backward compatibility issue will be
> fixed
> > ??
> > > >
> > > > --
> > > > Nitin Pawar
> > > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>
>
> --
> Nitin Pawar
>


Re: Blocker on drill upgrade path

2019-04-19 Thread Aman Sinha
This is legal:
  select max(last_name)  from cp.`employee.json` group by last_name limit 5;
But this is not:
  select max(last_name) last_name from cp.`employee.json` group by
last_name limit 5;

The reason is the second query is aliasing the max() output to 'last_name'
which is being referenced in the group-by clause.  Referencing an aggregate
expr in the group-by is not allowed by SQL standards, hence Calcite (which
does the parsing and validation, not Drill) throws this error during
validation phase.  Detailed error stack is below.  I don't think this would
have worked in 1.13 either.  My guess is you may have run the first query
in 1.13 and that should still continue to work.

Validation error thrown by Calcite:

Caused By (org.apache.calcite.sql.validate.SqlValidatorException) Aggregate
expression is illegal in GROUP BY clause

sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2

sun.reflect.NativeConstructorAccessorImpl.newInstance():62

sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45

java.lang.reflect.Constructor.newInstance():423

org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463

org.apache.calcite.runtime.Resources$ExInst.ex():572

org.apache.calcite.sql.SqlUtil.newContextException():787

org.apache.calcite.sql.SqlUtil.newContextException():772


org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4788


org.apache.calcite.sql.validate.SqlValidatorImpl.validateGroupClause():3941

org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3306

org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60

org.apache.calcite.sql.validate.AbstractNamespace.validate():84

org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():977

org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():953

org.apache.calcite.sql.SqlSelect.validate():216


org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():928

org.apache.calcite.sql.validate.SqlValidatorImpl.validate():632

org.apache.drill.exec.planner.sql.SqlConverter.validate():207

On Fri, Apr 19, 2019 at 12:39 PM Nitin Pawar 
wrote:

> I think the error is not with storage plugin but with query parsing
>
> here is the exception
> 0: jdbc:drill:zk=local> select max(last_name) last_name from
> cp.`employee.json` group by last_name limit 5;
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 21:
> Aggregate expression is illegal in GROUP BY clause
>
> On Fri, Apr 19, 2019 at 11:57 PM Kunal Khatua  wrote:
>
> > (Replying on the Drill user list)
> >
> > This is odd. The CP storage plugin is inbuilt with Drill and that hasn't
> > changed. 1.15 by itself works fine.
> >
> > What is the error you are seeing, Nitin?
> >
> >
> > On 4/18/2019 10:58:48 PM, Nitin Pawar  wrote:
> > Hi,
> >
> > We are trying to upgrade drill from 1.13 to 1.15
> > following query works in drill 1.13 but not in 1.15
> >
> > select max(last_name) from cp.`employee.json` group by last_name limit 5
> >
> > can you let us know if this backward compatibility issue will be fixed ??
> >
> > --
> > Nitin Pawar
> >
>
>
> --
> Nitin Pawar
>


Re: Query Question

2019-04-11 Thread Aman Sinha
> I thought flatten() would be the answer, however, if I flatten the
columns, I get the following result:

Regarding the flatten() output, this is expected because doing a 'SELECT
flatten(a),  flatten(b) FROM T'  is equivalent to doing a cross-product of
the 2 arrays.

In your example, both arrays are the same length, but what would you expect
the output to be if they were different ?   I don't see a direct SQL way of
doing it but
even with UDFs the semantics should be defined.

Aman

On Thu, Apr 11, 2019 at 6:37 AM Charles Givre  wrote:

> That’s a good idea.  I’ll work on a equivalent ZIP() function and submit
> as a separate PR.
> — C
>
> > On Apr 10, 2019, at 20:44, Paul Rogers 
> wrote:
> >
> > Hi Charles,
> >
> > In Python [1], the "zip" function does this task:
> >
> >
> > zip([1, 2, 3], [4, 5, 6]) --> [(1, 4), (2, 5), (3, 6)]
> >
> >
> > When you gathered the list of functions for the Drill book, did you come
> across anything like this in Drill? I presume you didn't, hence the
> question. I did a quick (incomplete) check and didn't see any likely
> candidates.
> >
> > Perhaps you could create such a function.
> >
> > Once you have the zipped result, you could flatten to get the pairs as
> rows.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >On Wednesday, April 10, 2019, 5:26:10 PM PDT, Charles Givre <
> cgi...@gmail.com> wrote:
> >
> > Hello Drillers,
> > I have a query question for you.  I have some really ugly data that has
> a field like this:
> >
> > compound_field : { “field_1”: [1,2,3],
> > “field_2”:[4,5,6]
> > )
> >
> > I would like to map fields 1 and 2 to columns so that the end result is:
> >
> > field1 | field2
> > 1| 4
> > 2  |  5
> > 3  |  5
> >
> > I thought flatten() would be the answer, however, if I flatten the
> columns, I get the following result:
> >
> > field1 | field2
> > 1  |  4
> > 1  |  5
> > 1  |  6
> >
> > Does anyone have any suggestions?
> > Thanks,
> > —C
>
>


Re: Issue faced in Apache drill

2019-04-09 Thread Aman Sinha
The last suggestion from Paul about CASTing to desired type should work:
   SELECT a, SUM(CAST(b as INT) ) FROM dfs.`C:\\Users\\user\\Desktop
 sample.json` group by a;

I suggest filing a JIRA for the original query because for some reason if
all values are NULLs,  (and this is with group-by),
Drill's aggregate function code generator is defaulting to a varchar
specific function.

Aman

On Tue, Apr 9, 2019 at 4:58 PM Paul Rogers 
wrote:

> Hi Gayathri,
>
> If you only have the content shown, then note that your "b" columns are
> always null. Drill has no way to know what type b is supposed to be.
> Normally, Drill guesses Nullable Int.
>
> I wonder, have you turned on the "store.json.all_text_mode" session option
> to tell JSON to read all columns as VarChar? If so, that would explain why
> the error message says that b is VarChar.
>
> The team is in the process of adding a schema system to resolve this kind
> of ambiguity: you'll be able to say, "b is a DOUBLE. Even if it is all
> nulls, go ahead and treat it as double."
>
> In the mean time, you can try Rahul's suggestion. Or, you can ensure that
> the b column has actual data. Or, you can try casting the (NULL VarChar) b
> columns to the desired type.
>
> Thanks,
> - Paul
>
>
>
> On Tuesday, April 9, 2019, 4:32:13 PM PDT, Gayathri Selvaraj <
> gayathri.selvar...@gmail.com> wrote:
>
>  Hi Team,
>
>
> Facing some issues with the following case:
>
> Json file (*sample.json*) is having the following content:
> {"a":2,"b":null} {"a":2,"b":null} {"a":3,"b":null} {"a":4,"b":null}
>
> *Query:*
>
> SELECT a, sum(b) FROM dfs.`C:\\Users\\user\\Desktop
> sample.json` group by a;
>
> *Error:*
>
> UNSUPPORTED_OPERATION ERROR: Only COUNT, MIN and MAX aggregate functions
> supported for VarChar type
>
> *Observation:*
>
> If we query without using group by, then it is working fine without any
> error. If group by is used, then sum of null values is throwing the above
> error.
>
>
>
> Can anyone please let us know the solution for this or if there are any
> alternative. I have raised a JIRA ticket for the same -
> https://issues.apache.org/jira/browse/DRILL-7161
>
>
> Regards,
>
> Gayathri
>


Re: January Apache Drill board report

2019-01-31 Thread Aman Sinha
Thanks for putting this together, Arina.
The Drill Developer Day and Meetup were separate events, so you can split
them up.
  - A half day Drill Developer Day was held on Nov 14.  A variety of
technical design issues were discussed.
  - A Drill user meetup was held on the same evening.  2 presentations -
one on use case for Drill and one about indexing support in Drill were
presented.

Rest of the report LGTM.

-Aman


On Thu, Jan 31, 2019 at 7:58 AM Arina Ielchiieva  wrote:

> Hi all,
>
> please take a look at the draft board report for the last quarter and let
> me know if you have any comments.
>
> Thanks,
> Arina
>
> =
>
> ## Description:
>  - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> Storage.
>
> ## Issues:
>  - There are no issues requiring board attention at this time.
>
> ## Activity:
>  - Since the last board report, Drill has released version 1.15.0,
>including the following enhancements:
>- Add capability to do index based planning and execution
>- CROSS join support
>- INFORMATION_SCHEMA FILES and FUNCTIONS were added
>- Support for TIMESTAMPADD and TIMESTAMPDIFF functions
>- Ability to secure znodes with custom ACLs
>- Upgrade to SQLLine 1.6
>- Parquet filter pushdown for VARCHAR and DECIMAL data types
>- Support JPPD (Join Predicate Push Down)
>- Lateral join functionality was enabled by default
>- Multiple Web UI improvements to simplify the use of options and submit
> queries
>- Query performance with the semi-join functionality was improved
>- Support for aliases in the GROUP BY clause
>- Options to return null for empty string and prevents Drill from
> returning
>  a result set for DDL statements
>- Storage plugin names became case-insensitive
>
> - Drill developer meet up was held on November 14, 2018.
>
> ## Health report:
>  - The project is healthy. Development activity
>as reflected in the pull requests and JIRAs is good.
>  - Activity on the dev and user mailing lists are stable.
>  - Three committers were added in the last period.
>
> ## PMC changes:
>
>  - Currently 23 PMC members.
>  - No new PMC members added in the last 3 months
>  - Last PMC addition was Charles Givre on Mon Sep 03 2018
>
> ## Committer base changes:
>
>  - Currently 51 committers.
>  - New commmitters:
> - Hanumath Rao Maduri was added as a committer on Thu Nov 01 2018
> - Karthikeyan Manivannan was added as a committer on Fri Dec 07 2018
> - Salim Achouche was added as a committer on Mon Dec 17 2018
>
> ## Releases:
>
>  - 1.15.0 was released on Mon Dec 31 2018
>
> ## Mailing list activity:
>
>  - d...@drill.apache.org:
> - 415 subscribers (down -12 in the last 3 months):
> - 2066 emails sent to list (2653 in previous quarter)
>
>  - iss...@drill.apache.org:
> - 18 subscribers (up 0 in the last 3 months):
> - 2480 emails sent to list (3228 in previous quarter)
>
>  - user@drill.apache.org:
> - 592 subscribers (down -5 in the last 3 months):
> - 249 emails sent to list (310 in previous quarter)
>
>
> ## JIRA activity:
>
>  - 196 JIRA tickets created in the last 3 months
>  - 171 JIRA tickets closed/resolved in the last 3 months
>


Re: Nested Window Queries

2019-01-03 Thread Aman Sinha
John,   what's the full SQL query that you submitted ?

On Thu, Jan 3, 2019 at 6:45 AM John Omernik  wrote:

> Is there a limitation on nesting of of Window Queries?  I have a query
> where I am using an event stream, and the changing of a value to indicate
> an event. (The state goes from disconnected, to charging, to complete, it
> reports many times in each of those states, but I am using lag(state, 1)
> over (order by event_ts) to find those records where current record is say
> complete and previous record was charging to indicate that the state has
> changed.
>
> This works well.
>
> Now I am trying to take that result set, and do another window to find the
> delta time between when the charging started and when the charging
> completed, you would think that a nested query showing the state change
> times, and then a select lag(event_ts, 1) over (order by event_ts) would be
> able to get me both the current time (when the charging was complete) and
> the previous record event_ts (when the charging started as outputted by the
> subquery). However, I am getting a verbose confusing error that perhaps we
> can help figure out...
>
> The first part is:
>
> SYSTEM ERROR: CannotPlanException: Node
> [rel#92358:Subset#11.LOGICAL.ANY([]).[3]] could not be implemented;
> planner state:
>
> Root: rel#92358:Subset#11.LOGICAL.ANY([]).[3]
> Original rel:
>
>
> The rest is pages of information related to the query, the data etc. I
> can send that separately if you are interested.
>
>
> I guess, before sending all of that, are nested window queries just
> not alloweable?
>
>
> If that is the case, what are some alternative approaches?
>
>
> Thanks,
>
> John
>


Re: [VOTE] Apache Drill release 1.15.0 - RC0

2018-12-18 Thread Aman Sinha
@vita...@apache.org   any idea why there's an
extraneous directory in the source ?
drwxrwxr-x vitalii/vitalii   0 2018-12-18 03:48
apache-drill-1.15.0-src/${project.*basedir*}/

drwxrwxr-x vitalii/vitalii   0 2018-12-18 03:48
apache-drill-1.15.0-src/${project.*basedir*}/src/

drwxrwxr-x vitalii/vitalii   0 2018-12-18 03:48
apache-drill-1.15.0-src/${project.*basedir*}/src/site/

drwxrwxr-x vitalii/vitalii   0 2018-12-18 03:48
apache-drill-1.15.0-src/${project.*basedir*}/src/site/resources/

drwxrwxr-x vitalii/vitalii   0 2018-12-18 03:48
apache-drill-1.15.0-src/${project.*basedir*}/src/site/resources/repo/

On Tue, Dec 18, 2018 at 10:58 AM Vitalii Diravka  wrote:

> Hi all,
>
> I'd like to propose the first release candidate (RC0) of Apache Drill,
> version 1.15.0.
>
> The release candidate covers a total of 185 resolved JIRAs [1].
> Thanks to everyone who contributed to this release.
>
> The tarball artifacts are hosted at [2] and the maven artifacts are hosted
> at [3].
>
> This release candidate is based on commit
> ff797695de0e27a732c22e2410cbef58abbfcef3 located at [4].
>
> Please download and try out the release.
>
> The vote ends at 7:00 PM UTC (11:00 AM PDT, 9:00 PM EET, 0:30 AM IST), Dec
> 21th, 2018
>
> [ ] +1
> [ ] +0
> [ ] -1
>
> I have found two issues, which are not observed earlier [5], [6]. Should we
> consider them as blockers?
>
>   [1]
> *
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343317==12313820
> <
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343317==12313820
> >*
>
>   [2] *http://home.apache.org/~vitalii/drill/releases/1.15.0/rc0/
> *
>
>   [3] *
> https://repository.apache.org/content/repositories/orgapachedrill-1058/
> *
>
>   [4] *https://github.com/vdiravka/drill/commits/drill-1.15.0
> *
>
>   [5] https://issues.apache.org/jira/browse/DRILL-6912
>
>   [6] https://issues.apache.org/jira/browse/DRILL-6913
>
> Kind regards
> Vitalii
>


Re: [ANNOUNCE] New Committer: Salim Achouche

2018-12-17 Thread Aman Sinha
Congratulations Salim !  Thanks for your contributions !

Aman

On Mon, Dec 17, 2018 at 3:20 AM Vitalii Diravka  wrote:

> Congratulations Salim!
> Well deserved!
>
> Kind regards
> Vitalii
>
>
> On Mon, Dec 17, 2018 at 12:40 PM Arina Ielchiieva 
> wrote:
>
> > The Project Management Committee (PMC) for Apache Drill has invited Salim
> > Achouche to become a committer, and we are pleased to announce that he
> has
> > accepted.
> >
> > Salim Achouche [1] started contributing to the Drill project in 2017. He
> > has made many improvements for the parquet reader, including performance
> > for flat data types, columnar parquet batch sizing functionality, fixed
> > various bugs and memory leaks. He also optimized implicit columns
> handling
> > with scanner and improved sql pattern contains performance.
> >
> > Welcome Salim, and thank you for your contributions!
> >
> > - Arina
> > (on behalf of Drill PMC)
> >
>


Re: Is there any plan for drill to support Parquet Format version 2.5 added column indexes?

2018-12-12 Thread Aman Sinha
This seems quite interesting.  Drill does row group pruning, but doing the
page level pruning based on indexes would be big win.
Also, as you may know, Drill recently added a feature to leverage secondary
indexes in NoSQL databases [1].  However, we have to see whether
that capability applies to the Parquet index since the Parquet index is
local to each file.

Please create a JIRA and add your input into it.  Thanks.

[1] https://issues.apache.org/jira/browse/DRILL-6381

On Wed, Dec 12, 2018 at 10:30 AM Lou kevin  wrote:

> Hi, I am a drill user and use parquet as the store format.
> I have known some new feature has been added to the latest Parquet Format.
> The new Parquet feature of column indexes seams very attractive and is
> there any plan to be supported in drill?
>
> thanks very much!
>
> the feature detail:
> https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
> See https://issues.apache.org/jira/browse/PARQUET-1201
>
> And the goals: make both range scans and point lookups I/O efficient by
> allowing direct access to pages based on their min and max values. In
> particular:
> 1.A single-row lookup in a rowgroup based on the sort column of that
> rowgroup will only read one data page per retrieved column. Range scans on
> the sort column will only need to read the exact data pages that contain
> relevant data.
> 2.Make other selective scans I/O efficient: if we have a very selective
> predicate on a non-sorting column, for the other retrieved columns we
> should only need to access data pages that contain matching rows.
> 3.No additional decoding effort for scans without selective predicates,
> e.g., full-row group scans. If a reader determines that it does not need to
> read the index data, it does not incur any overhead.
> 4.Index pages for sorted columns use minimal storage by storing only the
> boundary elements between pages.
>


Re: Apache Drill Meetup on Nov 14th!

2018-11-20 Thread Aman Sinha
Sharing the link to my Meetup slides:

https://www.slideshare.net/AmanSinha6/accelerating-sql-queries-in-nosql-databases-using-apache-drill-and-secondary-indexes

thanks,
Aman

On Sun, Nov 4, 2018 at 2:46 PM Pritesh Maker  wrote:

> Hello, Drillers!
>
> We are restarting meetups for Apache Drill! The next meet up will be on Nov
> 14th at 6:30 PM at the MapR Headquarters.
>
> We will have two speakers for the meetup
> - Nitin Sharma @ Netflix who will talk about Netflix's Personalization
> Infrastructure
> - Aman Sinha @ MapR who will talk about a brand new feature in Apache Drill
> to leverage Secondary Indexes
>
> You can find more details of their proposed talks at the meetup site.
> Please register soon since we have limited seating!
> https://www.meetup.com/Bay-Area-Apache-Drill-User-Group/events/255727785/
>
> We look forward to seeing you there!
>
> Thank you,
> Pritesh
>


Re: Hangout Discussion Topics

2018-11-12 Thread Aman Sinha
Since we are having the Drill Developer day on Wednesday, perhaps we can
skip the hangout tomorrow ?

Aman

On Mon, Nov 12, 2018 at 10:13 AM Timothy Farkas  wrote:

> Hi All,
>
> Does anyone have any topics to discuss during the hangout tomorrow?
>
> Thanks,
> Tim
>


Re: [ANNOUNCE] New Committer: Hanumath Rao Maduri

2018-11-01 Thread Aman Sinha
Congratulations Hanumath !

Aman

On Thu, Nov 1, 2018 at 11:39 AM Paul Rogers 
wrote:

> Congratulations Hanu!
>
> - Paul
>
> Sent from my iPhone
>
> > On Nov 1, 2018, at 11:09 AM, Kunal Khatua  wrote:
> >
> > Congratulations, Hanu!
> > On 11/1/2018 11:04:58 AM, Abhishek Girish  wrote:
> > Congratulations, Hanu!
> >
> >> On Thu, Nov 1, 2018 at 10:56 AM Khurram Faraaz wrote:
> >>
> >> Congratulations Hanu!
> >>
> >>> On Thu, Nov 1, 2018 at 10:14 AM Gautam Parai wrote:
> >>>
> >>> Congratulations Hanumath! Well deserved :)
> >>>
> >>> Gautam
> >>>
> >>> On Thu, Nov 1, 2018 at 9:44 AM AnilKumar B
> >> wrote:
> >>>
>  Congratulations Hanumath.
> 
>  Thanks & Regards,
>  B Anil Kumar.
> 
> 
>  On Thu, Nov 1, 2018 at 9:39 AM Vitalii Diravka
> >>> wrote:
> 
> > Congratulations!
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Thu, Nov 1, 2018 at 5:43 PM salim achouche
> > wrote:
> >
> >> Congrats Hanu!
> >>
> >> On Thu, Nov 1, 2018 at 6:05 AM Arina Ielchiieva
> > wrote:
> >>
> >>> The Project Management Committee (PMC) for Apache Drill has
> >> invited
> >>> Hanumath
> >>> Rao Maduri to become a committer, and we are pleased to announce
> >>> that
> > he
> >>> has accepted.
> >>>
> >>> Hanumath became a contributor in 2017, making changes mostly in
> >> the
> > Drill
> >>> planning side, including lateral / unnest support. He is also one
> >>> of
> > the
> >>> contributors of index based planning and execution support.
> >>>
> >>> Welcome Hanumath, and thank you for your contributions!
> >>>
> >>> - Arina
> >>> (on behalf of Drill PMC)
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Salim
> >>
> >
> 
> >>>
> >>
>


Re: November Apache Drill board report

2018-11-01 Thread Aman Sinha
   Docket container  ==> 'Docker'
   November 14, 2019  ==>  2018  :)   (this is wrong in email that was sent
out)

Rest LGTM.

On Thu, Nov 1, 2018 at 6:42 AM Arina Ielchiieva  wrote:

> Hi all,
>
> please take a look at the draft board report for the last quarter and let
> me know if you have any comments.
>
> Thanks,
> Arina
>
> =
>
>  ## Description:
>  - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> Storage.
>
> ## Issues:
>  - There are no issues requiring board attention at this time.
>
> ## Activity:
>  - Since the last board report, Drill has released version 1.14.0,
>including the following enhancements:
> - Drill in a Docket container
> - Image metadata format plugin
> - Upgrade to Calcite 1.16.0
> - Kafka plugin push down support
> - Phonetic and String functions
> - Enhanced decimal data support
> - Spill to disk for the Hash Join support
> - CGROUPs resource management support
>  - There were active discussions about schema provision in Drill.
>Based on these discussions two projects are currently evolving:
>Drill metastore and schema provision in the file and in a query.
>  - Apache Drill book has been written by two PMC members (Charles and
> Paul).
>  - Drill developer meet up will be held on November 14, 2019.
>The following areas are going to be discussed:
> - Storage plugins
> - Schema discovery & Evolution
> - Metadata Management
> - Resource management
> - Integration with Apache Arrow
>
> ## Health report:
>  - The project is healthy. Development activity
>as reflected in the pull requests and JIRAs is good.
>  - Activity on the dev and user mailing lists are stable.
>  - Three committers and three new PMC member were added in the last period.
>
> ## PMC changes:
>
>  - Currently 23 PMC members.
>  - New PMC members:
> - Boaz Ben-Zvi was added to the PMC on Fri Aug 17 2018
> - Charles Givre was added to the PMC on Mon Sep 03 2018
> - Vova Vysotskyi was added to the PMC on Fri Aug 24 2018
>
> ## Committer base changes:
>
>  - Currently 48 committers.
>  - New commmitters:
> - Chunhui Shi was added as a committer on Thu Sep 27 2018
> - Gautam Parai was added as a committer on Mon Oct 22 2018
> - Weijie Tong was added as a committer on Fri Aug 31 2018
>
> ## Releases:
>
>  - 1.14.0 was released on Sat Aug 04 2018
>
> ## Mailing list activity:
>
>  - d...@drill.apache.org:
> - 427 subscribers (down -6 in the last 3 months):
> - 2827 emails sent to list (2126 in previous quarter)
>
>  - iss...@drill.apache.org:
> - 18 subscribers (down -1 in the last 3 months):
> - 3487 emails sent to list (4769 in previous quarter)
>
>  - user@drill.apache.org:
> - 597 subscribers (down -6 in the last 3 months):
> - 332 emails sent to list (346 in previous quarter)
>
>
> ## JIRA activity:
>
>  - 164 JIRA tickets created in the last 3 months
>  - 128 JIRA tickets closed/resolved in the last 3 months
>


Re: [HANGOUT] [new link] Topics for October 02 2018

2018-10-13 Thread Aman Sinha
Please use this link:  https://www.slideshare.net/secret/zMZIrpM5qKV5pI
I forgot the apache mailing lists block the attachments.

Aman


On Sat, Oct 13, 2018 at 5:20 PM Aman Sinha  wrote:

> On my gmail account it shows the attachment was sent.  I am re-attaching
> and sending.
>
> Aman
>
> On Sat, Oct 13, 2018 at 3:38 PM Chunhui Shi 
> wrote:
>
>> Hi Aman, are you going to send out the slides in another email?
>>
>> Regards,
>> Chunhui
>> --
>> From:Aman Sinha 
>> Send Time:2018 Oct 12 (Fri) 10:59
>> To:user ; dev 
>> Subject:Re: [HANGOUT] [new link] Topics for October 02 2018
>>
>> Attached is a PDF version of the slides.  Unfortunately, I don't have a
>> recording.
>>
>> thanks,
>> Aman
>>
>>
>> On Thu, Oct 11, 2018 at 9:39 AM Pritesh Maker  wrote:
>> Divya -  anyone is welcome to join the hangout! Aman will be sharing the
>>  slides shortly. We use Google Hangouts which doesn't have the option to
>>  record the session.
>>
>>  On Thu, Oct 11, 2018 at 1:06 AM Divya Gehlot 
>>  wrote:
>>
>>  > Can we have the recordings of the talk for the benefit of the other
>> drill
>>  > users in the community or it is a closed affair ?
>>  >
>>  >
>>  > Thanks,
>>  > Divya
>>  >
>>  > On Sat, 29 Sep 2018 at 05:13, Karthikeyan Manivannan <
>> kmanivan...@mapr.com
>>  > >
>>  > wrote:
>>  >
>>  > > Hi,
>>  > >
>>  > > We will have a Drill Hangout on October 2 2018 at 10 AM Pacific Time.
>>  > > Please suggest topics by replying to this thread.
>>  > >
>>  > > We now have a ==new Hangout link== that supports 25 participants
>>  > >
>>  >
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__meet.google.com_yki-2Diqdf-2Dtai=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=qfLrUky-Q0VH16D_8DbqCu_9zAq0dy_xYHqNyo_LBZ4=49AFD7imHiJVVZgvJm_bepjET2MbgKn8axkfn7BFvPI=
>>  > >
>>  > > Please note that in this Hangout, non-MapR participants will have to
>> wait
>>  > > to be let into the call by a MapR participant. Sorry for the
>>  > inconvenience.
>>  > >
>>  > > Thanks.
>>  > >
>>  > > Karthik
>>  > >
>>  >
>>
>


Re: [HANGOUT] [new link] Topics for October 02 2018

2018-10-13 Thread Aman Sinha
On my gmail account it shows the attachment was sent.  I am re-attaching
and sending.

Aman

On Sat, Oct 13, 2018 at 3:38 PM Chunhui Shi 
wrote:

> Hi Aman, are you going to send out the slides in another email?
>
> Regards,
> Chunhui
> --
> From:Aman Sinha 
> Send Time:2018 Oct 12 (Fri) 10:59
> To:user ; dev 
> Subject:Re: [HANGOUT] [new link] Topics for October 02 2018
>
> Attached is a PDF version of the slides.  Unfortunately, I don't have a
> recording.
>
> thanks,
> Aman
>
>
> On Thu, Oct 11, 2018 at 9:39 AM Pritesh Maker  wrote:
> Divya -  anyone is welcome to join the hangout! Aman will be sharing the
>  slides shortly. We use Google Hangouts which doesn't have the option to
>  record the session.
>
>  On Thu, Oct 11, 2018 at 1:06 AM Divya Gehlot 
>  wrote:
>
>  > Can we have the recordings of the talk for the benefit of the other
> drill
>  > users in the community or it is a closed affair ?
>  >
>  >
>  > Thanks,
>  > Divya
>  >
>  > On Sat, 29 Sep 2018 at 05:13, Karthikeyan Manivannan <
> kmanivan...@mapr.com
>  > >
>  > wrote:
>  >
>  > > Hi,
>  > >
>  > > We will have a Drill Hangout on October 2 2018 at 10 AM Pacific Time.
>  > > Please suggest topics by replying to this thread.
>  > >
>  > > We now have a ==new Hangout link== that supports 25 participants
>  > >
>  >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__meet.google.com_yki-2Diqdf-2Dtai=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=qfLrUky-Q0VH16D_8DbqCu_9zAq0dy_xYHqNyo_LBZ4=49AFD7imHiJVVZgvJm_bepjET2MbgKn8axkfn7BFvPI=
>  > >
>  > > Please note that in this Hangout, non-MapR participants will have to
> wait
>  > > to be let into the call by a MapR participant. Sorry for the
>  > inconvenience.
>  > >
>  > > Thanks.
>  > >
>  > > Karthik
>  > >
>  >
>


Re: [HANGOUT] [new link] Topics for October 02 2018

2018-10-12 Thread Aman Sinha
Attached is a PDF version of the slides.  Unfortunately, I don't have a
recording.

thanks,
Aman


On Thu, Oct 11, 2018 at 9:39 AM Pritesh Maker  wrote:

> Divya -  anyone is welcome to join the hangout! Aman will be sharing the
> slides shortly. We use Google Hangouts which doesn't have the option to
> record the session.
>
> On Thu, Oct 11, 2018 at 1:06 AM Divya Gehlot 
> wrote:
>
> > Can we have the recordings of the talk for the benefit of the other drill
> > users in the community or it is a closed affair ?
> >
> >
> > Thanks,
> > Divya
> >
> > On Sat, 29 Sep 2018 at 05:13, Karthikeyan Manivannan <
> kmanivan...@mapr.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > We will have a Drill Hangout on October 2 2018 at 10 AM Pacific Time.
> > > Please suggest topics by replying to this thread.
> > >
> > > We now have a ==new Hangout link== that supports 25 participants
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__meet.google.com_yki-2Diqdf-2Dtai=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=zySISmkmM4WNViCKijENtQ=qfLrUky-Q0VH16D_8DbqCu_9zAq0dy_xYHqNyo_LBZ4=49AFD7imHiJVVZgvJm_bepjET2MbgKn8axkfn7BFvPI=
> > >
> > > Please note that in this Hangout, non-MapR participants will have to
> wait
> > > to be let into the call by a MapR participant. Sorry for the
> > inconvenience.
> > >
> > > Thanks.
> > >
> > > Karthik
> > >
> >
>


Re: [HANGOUT] [new link] Topics for October 02 2018

2018-09-30 Thread Aman Sinha
I can talk about the index planning and execution feature [1] that is
currently in review [2].

[1[ https://issues.apache.org/jira/browse/DRILL-6381
[2] https://github.com/apache/drill/pull/1466

On Fri, Sep 28, 2018 at 2:13 PM Karthikeyan Manivannan 
wrote:

> Hi,
>
> We will have a Drill Hangout on October 2 2018 at 10 AM Pacific Time.
> Please suggest topics by replying to this thread.
>
> We now have a ==new Hangout link== that supports 25 participants
> http://meet.google.com/yki-iqdf-tai
>
> Please note that in this Hangout, non-MapR participants will have to wait
> to be let into the call by a MapR participant. Sorry for the inconvenience.
>
> Thanks.
>
> Karthik
>


Re: Drill Hangout tomorrow 06/26

2018-06-26 Thread Aman Sinha
Hangout attendees on 06/26:
Padma, Hanumath, Boaz, Aman, Jyothsna, Sorabh, Arina, Bohdan, Vitalii,
Volodymyr, Abhishek, Robert

2 topics were discussed:
1.  Vitalii brought up the Travis timeout issue for which he has sent out
an email in this thread;  Actually Vitalli can you send it in a separate
email with explicit subject otherwise people may miss it.
2. Padma went over the batch sizing work and current status.  Padma, pls
add a link to your document.  Summarizing some of the discussion:

   - Does batch sizing affect output batches only or internal batches also
   ?  For certain operators such as HashAgg it does affect the internal
   batches held in the hash table since these batches are transferred as-is to
   the output container.
   - 16 MB limit on the batch size is a best effort but in some cases it
   could slightly exceed.  The number of rows per output batch is estimated as
   nearest lower power of 2.  For example, if based on input batch size, the
   number of output rows is 600, it will be rounded down to 512.
   - An optimization could be done in future to have upstream operator
   provide the batch size information in metadata instead of downstream
   operator computing it for each incoming.
   - There was discussion on estimating the size of complex type columns
   especially ones with nesting levels.  It would be good to add details in
   the document.


-Aman

On Tue, Jun 26, 2018 at 10:48 AM Vitalii Diravka 
wrote:

> Lately Drill Travis Build fails more often because of Travis job time
> expires.
> The right way is to accelerate Drill execution :)
>
> Nevertheless I believe we should consider excluding some more tests from
> Travis Build.
> We can add all TPCH tests (
> TestTpchLimit0, TestTpchExplain, TestTpchPlanning, TestTpchExplain) to the
> SlowTest category.
>
> Is there other solution for this issue? What are other tests are executed
> very slowly?
>
> Kind regards
> Vitalii
>
>
> On Tue, Jun 26, 2018 at 3:34 AM Aman Sinha  wrote:
>
> > We'll have the Drill hangout tomorrow Jun26th, 2018 at 10:00 PDT.
> >
> > If you have any topics to discuss, send a reply to this post or just join
> > the hangout.
> >
> > ( Drill hangout link
> > <https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc> )
> >
>


Drill Hangout tomorrow 06/26

2018-06-25 Thread Aman Sinha
We'll have the Drill hangout tomorrow Jun26th, 2018 at 10:00 PDT.

If you have any topics to discuss, send a reply to this post or just join
the hangout.

( Drill hangout link
 )


Re: [DISCUSS] case insensitive storage plugin and workspaces names

2018-06-12 Thread Aman Sinha
Yes, that seems ok to me...since the plugin name and workspace are logical
entities and don't correspond to a path.
There could be compatibility issues if certain users have relied on the
case-sensitive names, but those would be temporary.

Aman

On Tue, Jun 12, 2018 at 8:35 AM, Arina Yelchiyeva <
arina.yelchiy...@gmail.com> wrote:

> To make it clear we have three notions here: storage plugin name, workspace
> (schema) and table name (dfs.root.`/tmp/t`).
> My suggestion is the following:
> Storage plugin names to be case insensitive (DFS vs dfs, INFORMATION_SCHEMA
> vs information_schema).
> Workspace  (schemas) names to be case insensitive (ROOT vs root, TMP vs
> tmp). Even if user has two directories /TMP and /tmp, he can create two
> workspaces but not both with tmp name. For example, tmp vs tmp_u.
> Table names case sensitivity are treated per plugin. For example, system
> plugins (information_schema, sys) table names (views, tables) should be
> case insensitive. Actually, currently for sys plugin table names are case
> insensitive, information_schema table names are case sensitive. That needs
> to be synchronized. For file system plugins table names must be case
> sensitive, since under table name we imply directory / file name and their
> case sensitivity depends on file system.
>
> Kind regards,
> Arina
>
> On Tue, Jun 12, 2018 at 6:13 PM Aman Sinha  wrote:
>
> > Drill is dependent on the underlying file system's case sensitivity.  On
> > HDFS one can create  'hadoop fs -mkdir /tmp/TPCH'  and /tmp/tpch which
> are
> > separate directories.
> > These could be set as workspace in Drill's storage plugin configuration
> and
> > we would want the ability to query both.   If we change the current
> > behavior, we would want
> > some way, either using back-quotes `  or other way to support that.
> >
> > RDBMSs seem to have vendor-specific behavior...
> > In MySQL [1] the database name and schema name are case-sensitive on
> Linux
> > and case-insensitive on Windows.   Whereas in Postgres it converts the
> > database name and schema name to lower-case by default but one can put
> > double-quotes to make it case-sensitive [2].
> >
> > [1]
> > https://dev.mysql.com/doc/refman/8.0/en/identifier-case-sensitivity.html
> > [2]
> > http://www.postgresqlforbeginners.com/2010/11/gotcha-case-
> sensitivity.html
> >
> >
> >
> > On Tue, Jun 12, 2018 at 5:01 AM, Arina Yelchiyeva <
> > arina.yelchiy...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Currently Drill we treat storage plugin names and workspaces as
> > > case-sensitive [1].
> > > Names for storage plugins and workspaces are defined by the user. So we
> > > allow to create plugin -> DFS and dfs, workspace -> tmp and TMP.
> > > I have a suggestion to move to case insensitive approach and won't
> allow
> > > creating two plugins / workspaces with the same name in different case
> at
> > > least for the following reasons:
> > > 1. usually rdbms schema and table names are case insensitive and many
> > users
> > > are used to this approach;
> > > 2. in Drill we have INFORMATION_SCHEMA schema which is in upper case,
> sys
> > > in lower case.
> > > personally I find it's extremely inconvenient.
> > >
> > > Also we should consider making table names case insensitive for system
> > > schemas (info, sys).
> > >
> > > Any thoughts?
> > >
> > > [1] https://drill.apache.org/docs/lexical-structure/
> > >
> > >
> > > Kind regards,
> > > Arina
> > >
> >
>


Re: [DISCUSS] case insensitive storage plugin and workspaces names

2018-06-12 Thread Aman Sinha
Drill is dependent on the underlying file system's case sensitivity.  On
HDFS one can create  'hadoop fs -mkdir /tmp/TPCH'  and /tmp/tpch which are
separate directories.
These could be set as workspace in Drill's storage plugin configuration and
we would want the ability to query both.   If we change the current
behavior, we would want
some way, either using back-quotes `  or other way to support that.

RDBMSs seem to have vendor-specific behavior...
In MySQL [1] the database name and schema name are case-sensitive on Linux
and case-insensitive on Windows.   Whereas in Postgres it converts the
database name and schema name to lower-case by default but one can put
double-quotes to make it case-sensitive [2].

[1] https://dev.mysql.com/doc/refman/8.0/en/identifier-case-sensitivity.html
[2]
http://www.postgresqlforbeginners.com/2010/11/gotcha-case-sensitivity.html



On Tue, Jun 12, 2018 at 5:01 AM, Arina Yelchiyeva <
arina.yelchiy...@gmail.com> wrote:

> Hi all,
>
> Currently Drill we treat storage plugin names and workspaces as
> case-sensitive [1].
> Names for storage plugins and workspaces are defined by the user. So we
> allow to create plugin -> DFS and dfs, workspace -> tmp and TMP.
> I have a suggestion to move to case insensitive approach and won't allow
> creating two plugins / workspaces with the same name in different case at
> least for the following reasons:
> 1. usually rdbms schema and table names are case insensitive and many users
> are used to this approach;
> 2. in Drill we have INFORMATION_SCHEMA schema which is in upper case, sys
> in lower case.
> personally I find it's extremely inconvenient.
>
> Also we should consider making table names case insensitive for system
> schemas (info, sys).
>
> Any thoughts?
>
> [1] https://drill.apache.org/docs/lexical-structure/
>
>
> Kind regards,
> Arina
>


Re: question about views

2018-03-19 Thread Aman Sinha
Due to an infinite loop occurring in Calcite planning, we had to disable
the filter pushdown past the union (SetOps).  See
https://issues.apache.org/jira/browse/DRILL-3855.
Now that we have rebased on Calcite 1.15.0, we should re-enable this and
test and if the pushdown works then the partition pruning on both sides of
the union should automatically work after that.

Will follow-up on this..

-Aman

On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua  wrote:

> I think Ted's question is 2 fold, with the former being more important.
> 1. Can we push filters past a union.
> 2. Will Drill push filters down to the source.
>
> For the latter, it depends on the source.
> For the former, it depends primarily on whether Calcite supports this. I
> haven't tried it, so I can't say.
>
> On 3/19/2018 2:22:54 PM, rahul challapalli 
> wrote:
> First I would suggest to ignore the view and try out a query which has the
> required filters as part of the subqueries on both sides of the union (for
> both the database and partitioned parquet data). The plan for such a query
> should have the answers to your question. If both the subqueries
> independently prune out un-necessary data, using partitions or indexes, I
> don't think adding a union between them would alter that behavior.
>
> -Rahul
>
> On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
>
> > IF I create a view that is a union of partitioned parquet files and a
> > database that has secondary indexes, will Drill be able to properly push
> > down query limits into both parts of the union?
> >
> > In particular, if I have lots of archival data and parquet partitioned by
> > time but my query only asks for recent data that is in the database, will
> > the query avoid the parquet files entirely (as you would wish)?
> >
> > Conversely, if the data I am asking for is entirely in the archive, will
> > the query make use of the partitioning on my parquet files correctly?
> >
>


Re: Accessing underlying scheme of input

2018-03-02 Thread Aman Sinha
Looks like the formatting got messed up for the query output.  Here it is
in simple text format:

// My original JSON data consisting of maps and arrays:

0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;

+-+---+---+
| a1  |  b1   |c1 |
+-+---+---+
| 10  | {"id":[1,2]}  | {"d1":5,"e1":{"f1":["CA","TX"]}}  |
+-+---+---+



0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
from t2.json;

+---++
| Fragment  | Number of records written  |
+---++
| 0_0   | 1  |
+---++


// Analyze using parquet-tools

[root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
schema tt3/0_0_0.parquet

message root {
  optional int64 a1;
  optional group b1 {
repeated int64 id;
  }
  optional group c1 {
optional int64 d1;
optional group e1 {
  repeated binary f1 (UTF8);
}
  }
}


On Fri, Mar 2, 2018 at 11:48 AM, Aman Sinha <amansi...@apache.org> wrote:

> Erol,
> yes indeed Drill is internally creating the schema for Json data.  The top
> level field's data type can be found using the TYPEOF(column) function that
> Gautam mentioned earlier.
> However, I understand you are looking for the nested schema as well, so I
> would recommend the following approach:
>
> - Run a CTAS and write a small sample of the table out in Parquet format.
> - Run parquet-tools [1] with the 'schema' option to explore the schema.
> - NOTE: Although this will show the Parquet schema, it at least will give
> an idea of the original JSON schema.
>
> Here's an example:
>
> *// My original JSON data consisting of maps and arrays: *
>
> 0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;
>
> *+-+---+---+*
>
> *| **a1 ** | ** b1  ** | **   c1** |*
>
> *+-+---+---+*
>
> *| *10 * | *{"id":[1,2]} * | *{"d1":5,"e1":{"f1":["CA","TX"]}} * |*
>
> *+-+---+---+*
>
>
>
> 0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
> from t2.json;
>
> *+---++*
>
> *| **Fragment ** | **Number of records written ** |*
>
> *+---++*
>
> *| *0_0  * | *1 * |*
>
> *+---++*
>
>
> *// Analyze using parquet-tools*
>
> [root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
> schema tt3/0_0_0.parquet
>
> message root {
>
>   optional int64 a1;
>
>   optional group b1 {
>
> repeated int64 id;
>
>   }
>
>   optional group c1 {
>
> optional int64 d1;
>
> optional group e1 {
>
>   repeated binary f1 (UTF8);
>
> }
>
>   }
>
> }
>
> The meaning of these types is defined here [2].
>
>
> [1] http://parquet.apache.org/downloads/
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
>
> On Thu, Mar 1, 2018 at 6:46 PM, Erol Akarsu <eaka...@gmail.com> wrote:
>
>> Boaz,
>>
>> Thanks for full explanation.
>> I got impression that Apache Drill is internally creating json schema for
>> input type. I thought we would be able to look deep into that schema,
>> Maybe it is not in "describe" level but it is API level we can achieve it.
>> That do you think?
>>
>> Erol Akarsu
>>
>> On Thu, Mar 1, 2018 at 7:04 PM, Boaz Ben-Zvi <bben-...@mapr.com> wrote:
>>
>> > From the docs (https://drill.apache.org/docs/describe/): “Currently,
>> > DESCRIBE does not support tables created in a file system.”
>> > Seems that it only works well for Hive and HBase tables.
>> >
>> > The create view statement does not explore the actual schema of the
>> > query’s table(s); it only parses and keeps the query.
>> > Hence it can not tell what ‘*’ (or any list of column names) means:
>> >
>> >  0: jdbc:drill:zk=local> create view pview as select * from
>> > dfs.`/data/PARQUET-1M.parquet`;
>> > +---+---+
>> > |  ok   |summary|
>> > +---+--

Re: Accessing underlying scheme of input

2018-03-02 Thread Aman Sinha
Erol,
yes indeed Drill is internally creating the schema for Json data.  The top
level field's data type can be found using the TYPEOF(column) function that
Gautam mentioned earlier.
However, I understand you are looking for the nested schema as well, so I
would recommend the following approach:

- Run a CTAS and write a small sample of the table out in Parquet format.
- Run parquet-tools [1] with the 'schema' option to explore the schema.
- NOTE: Although this will show the Parquet schema, it at least will give
an idea of the original JSON schema.

Here's an example:

*// My original JSON data consisting of maps and arrays: *

0: jdbc:drill:drillbit=10.10.101.41> select * from t2.json;

*+-+---+---+*

*| **a1 ** | ** b1  ** | **   c1** |*

*+-+---+---+*

*| *10 * | *{"id":[1,2]} * | *{"d1":5,"e1":{"f1":["CA","TX"]}} * |*

*+-+---+---+*



0: jdbc:drill:drillbit=10.10.101.41> create table dfs.tmp.tt3 as select *
from t2.json;

*+---++*

*| **Fragment ** | **Number of records written ** |*

*+---++*

*| *0_0  * | *1 * |*

*+---++*


*// Analyze using parquet-tools*

[root@aman1 ~]# java -jar  parquet-tools/target/parquet-tools-1.9.0.jar
schema tt3/0_0_0.parquet

message root {

  optional int64 a1;

  optional group b1 {

repeated int64 id;

  }

  optional group c1 {

optional int64 d1;

optional group e1 {

  repeated binary f1 (UTF8);

}

  }

}

The meaning of these types is defined here [2].


[1] http://parquet.apache.org/downloads/
[2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

On Thu, Mar 1, 2018 at 6:46 PM, Erol Akarsu  wrote:

> Boaz,
>
> Thanks for full explanation.
> I got impression that Apache Drill is internally creating json schema for
> input type. I thought we would be able to look deep into that schema,
> Maybe it is not in "describe" level but it is API level we can achieve it.
> That do you think?
>
> Erol Akarsu
>
> On Thu, Mar 1, 2018 at 7:04 PM, Boaz Ben-Zvi  wrote:
>
> > From the docs (https://drill.apache.org/docs/describe/): “Currently,
> > DESCRIBE does not support tables created in a file system.”
> > Seems that it only works well for Hive and HBase tables.
> >
> > The create view statement does not explore the actual schema of the
> > query’s table(s); it only parses and keeps the query.
> > Hence it can not tell what ‘*’ (or any list of column names) means:
> >
> >  0: jdbc:drill:zk=local> create view pview as select * from
> > dfs.`/data/PARQUET-1M.parquet`;
> > +---+---+
> > |  ok   |summary|
> > +---+---+
> > | true  | View 'pview' created successfully in 'dfs.tmp' schema  |
> > +---+---+
> > 1 row selected (0.274 seconds)
> > 0: jdbc:drill:zk=local> describe pview;
> > +--+---+--+
> > | COLUMN_NAME  |   DATA_TYPE   | IS_NULLABLE  |
> > +--+---+--+
> > | **   | DYNAMIC_STAR  | YES  |
> > +--+---+--+
> > 1 row selected (0.419 seconds)
> > 0: jdbc:drill:zk=local> create view pview2 as select
> > row_count,just_garbage from  dfs.`/data/PARQUET-1M.parquet`;
> > +---+-+
> > |  ok   | summary |
> > +---+-+
> > | true  | View 'pview2' created successfully in 'dfs.tmp' schema  |
> > +---+-+
> > 1 row selected (0.3 seconds)
> > 0: jdbc:drill:zk=local> describe pview2;
> > +---++--+
> > |  COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> > +---++--+
> > | row_count | ANY| YES  |
> > | just_garbage  | ANY| YES  |
> > +---++--+
> > 2 rows selected (0.533 seconds)
> > 0: jdbc:drill:zk=local>
> >
> > Thanks,
> >
> >   Boaz
> >
> > On 3/1/18, 3:05 PM, "Padma Penumarthy"  wrote:
> >
> > Not sure why it is not showing the fields. It does not work for me
> > either.
> > Does anyone know more ? Is this broken ?
> >
> > Thanks
> > Padma
> >
> > > On Mar 1, 2018, at 2:54 PM, Erol Akarsu  wrote:
> > >
> > > Somehow, after "user dfs.tmp", I was able to create view. But
> > "describe"
> > > for view does not give much information. I 

Re: Which Hadoop File Format Should I Use?

2018-02-07 Thread Aman Sinha
The multi-level indexing feature in Carbondata seems very interesting...it
will allow persisting OLAP cubes and provide efficient access; virtually
providing the capability that specialized OLAP engines provide.   The ORC
format also provides indexing but it seems not multi-level indexing.

Another promising use is for secondary indexing.  Basically, making the
file format competitive with NoSQL systems that support secondary indexes.

On Wed, Feb 7, 2018 at 2:08 PM, Ted Dunning  wrote:

> Carbondata does look very cool, but I haven't seen any significant user
> adoption which means that I haven't heard very many war stories.
>
>
>
> On Wed, Feb 7, 2018 at 11:58 AM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
>
> > ...
> > The Carbondata project looks quite promising.
> >
> > Any thoughts on what file format you prefer?
> >
>


Re: Google Hangouts: Lateral Join High Level Design Presentation

2018-02-06 Thread Aman Sinha
It looks like Volodymyr also had a topic: Decimal types support.  He is
starting with that.
I am not sure if there is going to be sufficient time to cover 2 topics
today...

On Tue, Feb 6, 2018 at 10:00 AM, Timothy Farkas  wrote:

> Google Hangout Reminder.
>
> https://hangouts.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>
>
> 
> From: Timothy Farkas 
> Sent: Monday, February 5, 2018 12:35:22 PM
> To: d...@drill.apache.org; user@drill.apache.org
> Subject: Google Hangouts: Lateral Join High Level Design Presentation
>
> Hi All,
>
> Aman and Sorabh will be talking about the high level design of lateral
> join in the next hangout session tomorrow. Since lateral join is a big
> topic they'll talk about more of the details of the design after Parth
> comes back in another hangout session.
>
> Thanks,
> Tim
>


Re: Does Drill support Full-text search as in Elasticsearch

2017-12-06 Thread Aman Sinha
Hi,
no, currently Drill does not support full-text search and it is unlikely it
will do so natively in the near future.  Instead, since other products such
as Elasticsearch are specialized for it, it would be better to add ES
storage plugins in Drill to push down such full-text search filters into
ES.  There was some effort in the past by the community to develop that but
it did not make it into the product.

-Aman

On Tue, Dec 5, 2017 at 5:08 AM, Ayush Pandey 
wrote:

> Hello everyone,
> I am trying to benchmark Apache Drill Vs Elastic Search. The first
> question that I am trying to answer is: Whether Apache Drill supports
> Full-Text search as is the case with Elasticsearch.
>
> I have dumped data in Drill in Apache Parquet format.


Re: How to verify predicate pushdown

2017-10-26 Thread Aman Sinha
For simple AND-ed predicates, if the entire filter is eligible for pushdown
then after pushdown the filter would be dropped, so the Explain plan will
not show a Filter node.  In this case, I would expect the row count of the
Scan to be reduced.  For complex predicates, e.g combination of ANDs and
ORs or ones which involve expressions such as CAST etc.,  it is possible
some part of the filter would be pushed into the Scan but the Filter node
will remain.   In this case, the row count of the Scan would not be reduced
to avoid doing the reduction twice.


-Aman


On Thu, Oct 26, 2017 at 5:33 AM, PROJJWAL SAHA  wrote:

> Hello,
>
> One question,
> How to verify whether predicate pushdown is happening ?
>
> I have one parquet file generated using CTAS command. I have executed
> REFRESH METADATA.  I am firing a simple query with a WHERE clause. In the
> physical plan for the scan operation, i see rowcount as total number of
> rows in the data. Should this value gets lowered in case of predicate
> pushdown ? Is is necessary to sort the predicate column for this to take
> effect ? Any pointers ?
>
> Regards,
> Projjwal
>


No Drill hangout next Tuesday 19th Sept

2017-09-15 Thread Aman Sinha
Drillers,
Due to developers attending the hackathon, we won't be having the Drill
hangout next Tuesday. Next one will be on Tuesday Oct 3rd.

See you then !
-Aman


[HANGOUT] Suggestions for topics for 08/22

2017-08-21 Thread Aman Sinha
We will have the hangout at 10 AM PST tomorrow 08/22.  Please send in
suggestions for topics or provide it at the beginning of the hangout.

Hangout link:

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks.


Re: Drill Summit/Conference Proposal

2017-06-17 Thread Aman Sinha
Agree with Julian about ApacheCon being a better venue with a track (or
sub-conference ?) for Drill.  The size and scale of the wider conference is
likely to be beneficial despite other competing Apache projects.  Perhaps
the schedule could be managed in such a way that the conflicts with similar
projects is minimized.

-Aman

On Fri, Jun 16, 2017 at 10:02 AM, Julian Hyde  wrote:

> It’s a legitimate concern, but remember that the more “popular” platforms
> tend to have their own conferences already. The Apache projects with the
> most talks at ApacheCon were probably Beam and Spark, with 3 each.
>
> Without strong commercial backing (and a large advertising budget)
> DrillCon is not going to see the promotion or attendance of say Spark
> Summit or Kafka Summit. So, I think we have a glass half full: we can
> promote “DrillCon (occurring as part of ApacheCon)” and promise a a full
> day of fascinating talks by Drill users and developers. We can reach not
> only the 30-50 people who came for Drill, but the 600 attendees who came to
> ApacheCon for other reasons and might not be aware of Drill.
>
> Julian
>
>
> > On Jun 16, 2017, at 9:20 AM, Matt K  wrote:
> >
> > A counter point: I would be concerned that Drill would be overshadowed
> by more “popular” or more entrenched platforms.
> >
> > Drill is an excellent and somewhat unique tech that needs more exposure
> to grow. An event that focuses purely on Drill may have better success at
> that.
> >
> > The caveat may be that a Drill only summit may have low turnout and thus
> not succeed at the above goals, thus I’m not certain what the right choice
> is, but I wanted to bring up the point for discussion.
> >
> >
> > On June 14, 2017 at 2:32:45 PM, Julian Hyde (jh...@apache.org  jh...@apache.org>) wrote:
> >
> >> I like the idea of co-hosting a conference. ApacheCon in particular is
> a good venue, and they explicitly encourage sub-conferences (there are “Big
> Data” and “IoT” tracks, and this year there were sub-conferences for Tomcat
> and CloudStack). DrillCon was part of ApacheCon, people could attend a
> whole day of Drill talks, or they they could go to talks about other Apache
> projects in the larger conference, and connect with Apache members.
> >>
> >> Also, the conference is professionally organized, at a large hotel with
> good facilities.
> >>
> >> Unfortunately ApacheCon just happened (it was in Miami in May); but
> it’s something to consider next year.
> >>
> >> Julian
> >>
> >>
> >> > On Jun 14, 2017, at 9:18 AM, Charles Givre > wrote:
> >> >
> >> > Hi Bob,
> >> > Good to hear from you. I agree that there could be value in having a
> joint
> >> > Presto/Drill/Redshift conference, but how would you describe the
> overall
> >> > theme?
> >> >
> >> > In essence (not looking to start a flame war here...) these tools are
> >> > similar in terms of what the user experiences and I can definitely see
> >> > value in bringing the communities together. I also like the idea of
> >> > multiple tracks. I was thinking of having something like
> developer/analyst
> >> > tracks.
> >> > -- C
> >> >
> >> > On Wed, Jun 14, 2017 at 11:27 AM, Bob Rudis > wrote:
> >> >
> >> >> I grok this is the Drill list and I'm also a big user of Drill (and
> >> >> have made some UDFs) but there might be some efficacy in expanding
> the
> >> >> scope to the Presto and Redshift Spectrum communities. I'm not
> >> >> claiming there's 100% equivalence, but the broader view of being able
> >> >> to access multiple types of data sources from a central platform is
> >> >> compelling and -- at least in my circles -- not widely known. It
> could
> >> >> be an event with a a primary central track but three separate
> >> >> specializations that have a couple intraday time pockets.
> >> >>
> >> >> On Wed, Jun 14, 2017 at 8:55 AM, Charles Givre  > wrote:
> >> >>> Hello fellow Drill users and developers,
> >> >>> I've been involved with the Drill community for some time, and I was
> >> >>> thinking that it might be time to start exploring the idea of a
> Drill
> >> >>> Summit or Conference. If you're interested, please send me a note
> and
> >> >> I'll
> >> >>> start having some conversations about what's next.
> >> >>>
> >> >>> Personally, I think it could be extremely valuable to get Drill
> >> >> developers
> >> >>> and users together and share ideas about where things are and how
> people
> >> >>> are using Drill.
> >> >>> Thanks!!
> >> >>> -- C
> >> >>
>
>


Re: Getting Exception while running Drill-mondrian cube

2017-06-13 Thread Aman Sinha
Can you provide more details ?  e.g what query did you run ?
The   schema entry  *   indicates this is
a Hive table.   Is the Hive plugin enabled in the Drill storage plugin UI ?

Does the simple query  'Select count(*) from hive.`customer_w_ter` ' run ?

On Tue, Jun 13, 2017 at 12:18 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Hello..,
>
> I am trying to execute the mondrian schema which is integrated with Apache
> Drill. While running this schema i am getting *customer_w_ter  *table not
> found exception.
>
> here is my schema example:
>
> **
>
> **
>
> **
>
> **
>
> **
>
> **
>
> **
>
> **
>
> * type="String" uniqueMembers="true" levelType="Regular"
> hideMemberIf="Never">*
>
> **
>
> * type="String" uniqueMembers="false" levelType="Regular"
> hideMemberIf="Never">*
>
> **
>
> * column="STATE" type="String" uniqueMembers="false" levelType="Regular"
> hideMemberIf="Never">*
>
> **
>
> * type="String" uniqueMembers="false" levelType="Regular"
> hideMemberIf="Never">*
>
> **
>
> **
>
> **
>
> * formatString="#,###" aggregator="sum">*
>
> **
>
> * formatString="#,###" aggregator="sum">*
>
> **
>
> **
>
> **
>
>
> Please help me out.
>


Re: Drill data and database locality

2017-06-11 Thread Aman Sinha
Ivan,
yes, the scans for the various data sources are expected to use locality
information to perform the table scan.

If you only run the query against mongodb (the right side of union-all) and
the foreman is on server B, does it do the table scan on server A  which is
hosting the mongodb table ?  It is supposed to use locality information, so
if it is not then most likely there is a bug.Does the problem only
occur when the query is accessing multiple data sources including mongo ?

You could  enable DEBUG logging level in logback.xml and check the output
in drillbit.log file.. look for entries for MongoGroupScan that should
provide insight into whether the affinity based assignment is working
correctly.  Each data source has its own implementation of determining
'affinity'.

You may have already seen this, but just in case..here's the doc page for
the mongo plugin: https://drill.apache.org/blog/2014/11/19/sql-on-mongodb/
I am not familiar with this plugin but hopefully the original authors
(listed on that page) can chime in.

-Aman


On Sat, Jun 10, 2017 at 10:02 AM, Ivan Kovacevic 
wrote:

> Greetings,
>
> I have two servers. The first server (A) contains the zookeeper, a mongodb
> database and a drillbit. The second server (B) contains a hadoop
> distribution with several hive tables, a postgresql database and the other
> drillbit. Both drillbits can see eachother on the drill homepages, since
> they are both connected to the zookeeper on server A. When a query like the
> one on the bottom is run (the view dfs.lineorder consists of the hive and
> postgresql data on server B), drillbit B becomes the foreman and executes
> everything. It does not delegate the parts dealing with mongodb tables to
> the drillbit on server A, so 90% of the time required to run the query
> (approx. 30 minutes) is spent on sending the mongodb collection from server
> A to server B. Is there a way to force drillbit B to delegate the mongodb
> part of the query to drillbit B, or could this be a misconfiguration
> problem?
> Also, can drillbits be configured to access different databases (e.g. when
> each drillbit has access to a distinct subnet, so that not all drillbits
> can access the same databases)?
> Thank you.
>
> Sample query:
> SELECT SUM(revenue) AS revenue
> FROM (
> SELECT SUM(lo_extendedprice*lo_discount) AS revenue
> FROM dfs.tmp.lineorder, dfs.tmp.`date`
> WHERE lo_orderdate = d_datekey
> AND d_year = 1993
> AND lo_discount BETWEEN 1 AND 3
> AND lo_quantity < 25
> UNION ALL
> SELECT SUM(lo_extendedprice * lo_discount) AS revenue
> FROM mongo.test.ssb_europe ssb
> WHERE ssb.orderdate.d_year = 1993
> AND lo_discount BETWEEN 1 AND 3
> AND lo_quantity < 25
> );
>
> Configuration (drill-override.conf):
>
> On drillbit A (Windows Server 2008):
>
> drill.exec: {
>   cluster-id: "drillbits1",
>   zk.connect: "serverA:2181",
> impersonation: {
> enabled: true,
> max_chained_user_hops: 3
>   }
> }
>
> On drillbit B (Cloudera CDH 5.8.0):
> drill.exec: {
>   cluster-id: "drillbits1",
>   zk.connect: "serverA:2181",
>   impersonation: {
> enabled: true,
> max_chained_user_hops: 3
>   }
> }
>
>
> Best regards,
> Ivan
>


Re: Running cartesian joins on Drill

2017-05-11 Thread Aman Sinha
Muhammad,
The join condition  ‘a = b or (a is null && b is null)’ works.  Internally, 
this is converted to  ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it 
is not working and ends up with the Cartesian join condition.  Drill leverages 
Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?

-Aman

From: "Aman Sinha (asi...@mapr.com)" <asi...@mapr.com>
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev <d...@drill.apache.org>, user <user@drill.apache.org>
Cc: Shadi Khalifa <khal...@cs.queensu.ca>
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong" <zf...@mapr.com> wrote:



I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

+---+-+

|  ok   | summary |

+---+-+

| true  | planner.enable_nljoin_for_scalar_only updated.  |

+---+-+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+--+--+

| text | json |

+--+--+

| 00-00Screen

00-01  ProjectAllowDup(*=[$0], *0=[$1])

00-02NestedLoopJoin(condition=[true], joinType=[inner])

00-04  Project(T2¦¦*=[$0])

00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

00-03  Project(T3¦¦*=[$0])

00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana" <m.gelb...@gmail.com> wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join 
in my

query to a nested loop join ?



*-*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana



On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong <zf...@mapr.com> wrote:



> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

> without an explicit join condition, the query should use the Cartesian

> join/nested loop join.

>

> -- Zelaine

>

> On 5/11/17, 4:20 AM, "Anup Tiwari" <anup.tiw...@games24x7.com> wrote:

>

> Hi,

>

> I have one question here.. so if we have to use Cartesian join in 
Drill

> then do we have to follow some workaround like Shadi mention : 
adding a

> dummy column on the fly that has the value 1 in both tables and 
then

> join

> on that column leading to having a match of every row of the first

> table

> with every row of the second table, hence do a Cartesian product?

> OR

> If we just don't specify join condition like :

> select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

> this

> query as Cartesian join.

>

> Regards,

> *Anup Tiwari*

>

> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <zf...@mapr.com> 
wrote:

>

> > Cartesian joins in Drill are implemented as nested loop joins, 
and I

> think

> > you should see that reflected in the resultant query plan when 
you

> run

> > explain plan on the query.

> >

> > Yes, Cartesian j

Re: Running cartesian joins on Drill

2017-05-11 Thread Aman Sinha
I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong"  wrote:



I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

+---+-+

|  ok   | summary |

+---+-+

| true  | planner.enable_nljoin_for_scalar_only updated.  |

+---+-+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+--+--+

| text | json |

+--+--+

| 00-00Screen

00-01  ProjectAllowDup(*=[$0], *0=[$1])

00-02NestedLoopJoin(condition=[true], joinType=[inner])

00-04  Project(T2¦¦*=[$0])

00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

00-03  Project(T3¦¦*=[$0])

00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana"  wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join 
in my

query to a nested loop join ?



*-*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana



On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong  wrote:



> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

> without an explicit join condition, the query should use the Cartesian

> join/nested loop join.

>

> -- Zelaine

>

> On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:

>

> Hi,

>

> I have one question here.. so if we have to use Cartesian join in 
Drill

> then do we have to follow some workaround like Shadi mention : 
adding a

> dummy column on the fly that has the value 1 in both tables and 
then

> join

> on that column leading to having a match of every row of the first

> table

> with every row of the second table, hence do a Cartesian product?

> OR

> If we just don't specify join condition like :

> select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

> this

> query as Cartesian join.

>

> Regards,

> *Anup Tiwari*

>

> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  
wrote:

>

> > Cartesian joins in Drill are implemented as nested loop joins, 
and I

> think

> > you should see that reflected in the resultant query plan when 
you

> run

> > explain plan on the query.

> >

> > Yes, Cartesian joins/nested loop joins are expensive because 
you’re

> > effectively doing an MxN read of your tables.  There are more

> efficient

> > ways of processing a nested loop join, e.g., by creating an 
index on

> the

> > larger table in the join and then using that index to do lookups

> into that

> > table.  That way, the nested loop join cost is the cost of 
creating

> the

> > index + M, where M is the number of rows in the smaller table 
and

> assuming

> > the lookup cost into the index does minimize the amount of data 
read

> of the

> > second table.  Drill currently doesn’t do this.

> >

> > -- Zelaine

> >

> > On 5/8/17, 9:09 AM, "Muhammad Gelbana"  
wrote:

> >

> > ​I believe 

Hangout starting at 10am PST

2017-04-04 Thread Aman Sinha
Hangout link:
https://hangouts.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Re: Drill query planning taking a LONG time

2017-02-13 Thread Aman Sinha
In your drillbit.log file, can you look for the entries for the foreman node to 
see where the time is being spent ? 
e.g entries of the following type: 
  [275dec51-fcc1-f1bf-cb2f-57a838805a82:foreman] INFO  
o.a.d.exec.store.parquet.Metadata - Took 64 ms to read metadata from cache file

Each entry is timestamped, so if you see a long gap between two of them, that 
will give an idea about what’s going on.   
I am not familiar with how the S3 is set up.  My guess is there is some latency 
issues there that could be causing it but let’s first get the log output. 

-Aman

On 2/13/17, 12:05 PM, "David Kincaid"  wrote:

We've just recently started working with Drill and I'm seeing something
that doesn't seem right and I'm not sure how to troubleshoot. We have 100
Parquet files which are each about 400MB each using Snappy compression.
While trying to query this data I am seeing extraordinary planning time for
certain queries. The planning time is about 12 minutes and the actual
execution of the query is less than 2 minutes.

A few details of our setup. We are running Drill on an AWS EMR cluster on
m4.16xlarge nodes (64 cores, 256GB each). We've given Drill an 8GB Java
heap and 100GB Java direct memory. We have verified that the metadata cache
file is being created and used. We have tried a cluster of 10 nodes and a
cluster of 2 nodes with no difference in planning time or execution time.

Does anyone have some pointers on troubleshooting excessive planning time?
It seems like we must have something misconfigured or are missing
something. We're very new to Drill and I think I've exhausted all my
troubleshooting ideas so far. Any tips anyone can provide?

Here is the main query I've been experimenting with, so you can get a feel
for the query complexity:

select fltb1.sapId, yearmo,
COUNT(*) as totalcnt,
count(distinct(CASE
   WHEN
   (REPEATED_CONTAINS(fltb1.classLabels,
'Thing:Service:MedicalService:Diagnostic:Radiology:Ultrasound.*'))
   THEN fltb1.invoiceId
   END)) as ultracount,
count(distinct (CASE
WHEN
(REPEATED_CONTAINS(fltb1.classLabels,
'Thing:Service:MedicalService:Diagnostic:LaboratoryTest.*'))
THEN fltb1.invoiceId
END)) as labcount
from (select sapid, invoiceId, TO_CHAR(TO_TIMESTAMP(transactionDate,
'-MM-dd HH:mm:ss.SS'), '-MM') yearmo, classLabels
  from s3.cisexport.transactionView) fltb1
group by fltb1.sapId, yearmo;

Thanks,

Dave




Apache Drill Hangout Minutes (1/10/2017)

2017-01-10 Thread Aman Sinha
Attendees:  Arina, Aman, Khurram, Karthik, Paul, Roman



1.   Khurram wanted to know the status of Calcite rebasing. Roman indicated 
it was in progress but currently on temporary hold due to couple of higher 
priority issues.

2.   Lazy initialization for UDFs:  Paul wanted more background on this for 
the review.  Arina provided some details and will sync up with Paul.

3.   Case-sensitivity of temporary tables names based on review comments: 
after some discussion, we concluded that it should be case-insensitive, similar 
to the alias names for columns and tables.

4.   Function names without parenthesis:  suggest examining how 
current_date, current_time are implemented.  However, it will be easier to 
implement using parenthesis since there is no confusion with column names.


Re: Suggestion for topics for Drill hangout tomorrow

2017-01-10 Thread Aman Sinha
Hangout starting .. link: 
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

On 1/9/17, 5:24 PM, "Aman Sinha" <asi...@mapr.com> wrote:

Hi all,

The bi-weekly hangout is tomorrow (01/10/17, 10 AM PST). If you have
any suggestions for topics for tomorrow, please add to this thread. We will 
also ask for
topics at the beginning of the hangout.

Thanks.




Suggestion for topics for Drill hangout tomorrow

2017-01-09 Thread Aman Sinha
Hi all,

The bi-weekly hangout is tomorrow (01/10/17, 10 AM PST). If you have
any suggestions for topics for tomorrow, please add to this thread. We will 
also ask for
topics at the beginning of the hangout.

Thanks.


Re: [1.9.0] : UserException: SYSTEM ERROR: IllegalReferenceCountException: refCnt: 0 and then SYSTEM ERROR: IOException: Failed to shutdown streamer

2016-12-08 Thread Aman Sinha
Hi Anup,
since your original query was working on 1.6 and failed in 1.9,  could you
pls file a JIRA for this ?  It sounds like a regression related to
evaluation of a Project expression (based on the stack trace).  Since there
are several CASE exprs, quite likely something related to its evaluation.
It would be great if you can provide some sample data for someone to
debug.
Thanks.

On Thu, Dec 8, 2016 at 12:50 AM, Anup Tiwari 
wrote:

> Hi,
>
> I have removed few conditions from my query then it just worked fine.
>
> Also can someone tell me in which scenarios we throw "
> *IllegalReferenceCountException*" and how to handle it in different
> scenarios ?
>
> As i got this in another query and by removing some conditions it worked
> for me but when i execute that removed conditions alone in CTAS , it got
> executed successfully.
>
> Regards,
> *Anup Tiwari*
>
> On Wed, Dec 7, 2016 at 12:22 PM, Anup Tiwari 
> wrote:
>
> > Hi Team,
> >
> > I am getting below 2 error in my one of the query which was working fine
> > on 1.6, Please help me out in this:-
> >
> > 1. UserException: SYSTEM ERROR: IllegalReferenceCountException: refCnt:
> 0
> > 2. SYSTEM ERROR: IOException: Failed to shutdown streamer
> >
> > Please find below query and its stack trace :-
> >
> > *Query :-*
> >
> > create table a_tt3_reg_login as
> > select sessionid,
> >
> > count(distinct (case when ((( event = 'e.a' and ajaxUrl like
> > '%/ab/pL%t=r%' ) or ( (Base64Conv(Response) like '%st%tr%' and
> > Base64Conv(Response) not like '%error%') and ajaxUrl like '%/sign/ter%'
> ))
> > OR ( event = 'e.a' and ajaxUrl like '%/player/ter/ter.htm%' and
> > Base64Conv(Response) like '%st%tr%ter%tr%')  OR (id = '/ter/thyou.htm'
> and
> > url = '/pla/natlob.htm')) then sessionid end) )  as  regs,
> >
> > count(distinct (case when ( ajaxUrl like '%/signup/poLo%t=log%' and event
> > = 'e.a' ) or ( event = 'e.a' and ajaxUrl like '%j_spring_security_check%'
> > and Base64Conv(Response)  like '%st%tr%') then sessionid end) ) as login
> ,
> >
> > count(distinct (case when ((ajaxUrl like '/pl%/loadResponsePage.htm%fD=
> true=yes%'
> > or ajaxUrl like '/pl%/loadResponsePage.htm%fD=true=YES%') OR
> (ajaxUrl
> > like 'loadSuccessPage.do%fD=true=yes%' or ajaxUrl like
> > 'loadSuccessPage.do%fD=true=YES%'))  then sessionid end) ) as fd ,
> >
> > count(distinct (case when ((ajaxUrl like '/pl%/loadResponsePage.htm%fD=
> false=yes%'
> > or ajaxUrl like '/pl%/loadResponsePage.htm%fD=false=YES%') OR
> > (ajaxUrl like 'loadSuccessPage.do%fD=false=yes%' or ajaxUrl like
> > 'loadSuccessPage.do%fD=false=YES%')) then sessionid end) ) as rd
> >
> > from
> > tt2
> > group by sessionid;
> > Error: SYSTEM ERROR: IllegalReferenceCountException: refCnt: 0
> >
> > Fragment 14:19
> >
> > [Error Id: e4659753-f8d0-403c-9eec-0ff6f2e30dd9 on namenode:31010]
> > (state=,code=0)
> >
> >
> > *Stack Trace From Drillbit.log:-*
> >
> > [Error Id: e4659753-f8d0-403c-9eec-0ff6f2e30dd9 on namenode:31010]
> > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> > IllegalReferenceCountException: refCnt: 0
> >
> > Fragment 14:19
> >
> > [Error Id: e4659753-f8d0-403c-9eec-0ff6f2e30dd9 on namenode:31010]
> > at org.apache.drill.common.exceptions.UserException$
> > Builder.build(UserException.java:543) ~[drill-common-1.9.0.jar:1.9.0]
> > at org.apache.drill.exec.work.fragment.FragmentExecutor.
> > sendFinalState(FragmentExecutor.java:293) [drill-java-exec-1.9.0.jar:1.
> > 9.0]
> > at org.apache.drill.exec.work.fragment.FragmentExecutor.
> > cleanup(FragmentExecutor.java:160) [drill-java-exec-1.9.0.jar:1.9.0]
> > at org.apache.drill.exec.work.fragment.FragmentExecutor.run(
> FragmentExecutor.java:262)
> > [drill-java-exec-1.9.0.jar:1.9.0]
> > at org.apache.drill.common.SelfCleaningRunnable.run(
> SelfCleaningRunnable.java:38)
> > [drill-common-1.9.0.jar:1.9.0]
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> > [na:1.8.0_74]
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> > [na:1.8.0_74]
> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
> > Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0
> > at io.netty.buffer.AbstractByteBuf.ensureAccessible(
> AbstractByteBuf.java:1178)
> > ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> > at io.netty.buffer.DrillBuf.checkIndexD(DrillBuf.java:115)
> > ~[drill-memory-base-1.9.0.jar:4.0.27.Final]
> > at io.netty.buffer.DrillBuf.chk(DrillBuf.java:147)
> > ~[drill-memory-base-1.9.0.jar:4.0.27.Final]
> > at io.netty.buffer.DrillBuf.getByte(DrillBuf.java:775)
> > ~[drill-memory-base-1.9.0.jar:4.0.27.Final]
> > at org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.
> > isAscii(CharSequenceWrapper.java:143) ~[drill-java-exec-1.9.0.jar:1.9.0]
> > at 

Re: querying rest services

2016-12-06 Thread Aman Sinha
Pls see: https://drill.apache.org/docs/rest-api/   and see if it satisfies
your needs.


On Tue, Dec 6, 2016 at 5:51 AM, Remzi Düzağaç  wrote:

> Hi,
>
> I would like to query rest services like solr or elasticsearch rest
> interfaces (or any rest service)
> is it possible via apache drill?
>
> thx
>


Re: Table Metadata Question

2016-12-04 Thread Aman Sinha
Charles,
Drill does not have a metastore for tables, so unless you have defined a
view with CAST or are querying Hive tables (Hive has a metastore),  the
column types are determined at run-time.  Have you tried the typeof()
function ?
  SELECT  typeof(column) FROM dfs.`data.json` LIMIT 1;


On Sun, Dec 4, 2016 at 8:47 AM, Charles Givre  wrote:

> Hello all,
> I’m trying to figure out if there is some way to determine what types of
> columns a given data source contains.  It looks like this should exist in
> the INFORMATION_SCHEMA somewhere, but I’m not clear how to do it.
>
> For instance, let’s say that I have a file called data.json.  I could
> execute:
> SELECT *
> FROM dfs.`data.json`
> LIMIT 0
> and get the column names, but is there some way to get at the data types
> for these columns?
> TIA,
> — Charles


Re: Building a LogicalPlan

2016-12-02 Thread Aman Sinha
Unfortunately, the 'LogicalPlan' structure that is created by the
LogicalPlanBuilder does not go through the full Drill query optimization
process.  You are better off starting with a Calcite Rel  and then building
a Drill logical plan with 'Rels' (e.g DrillFilterRel, DrillProjectRel
etc.).

On Fri, Dec 2, 2016 at 2:21 AM, Ted Dunning  wrote:

> It might be easier to start by stealing a plan from a normal SQL query and
> then trying to build that.  Not that this is a long-term solution, but it
> gets you going with the many unstated assumptions about a plan that might
> be present.
>
> That would give you something in JSON, however, rather than Java code.
> That still might be very useful.
>
>
>
> On Fri, Dec 2, 2016 at 1:59 AM, Chris Baynes  wrote:
>
> > Hi,
> >
> > We have a use case in which we want to construct queries programmatically
> > which Drill would then execute.
> >
> > So far I've been able to configure a Jdbc StorageEngine, and initialize a
> > LogicalPlan (using the builder) with that. I am having difficulty trying
> to
> > configure scan, project, and filters. I know I need to construct a
> > LogicalOperator, and these are implemented in
> > org.apache.drill.common.logical.data
> >
> > However to construct an operator instance I need one or more of:
> > JSONOptions, NamedExpression, LogicalExpression, FieldReference.
> >
> > So right now I have something like this, but don't know how to fill in
> the
> > missing pieces...
> >
> > // scan
> > JSONOptions opts = new JSONOptions(...);
> > builder.addLogicalOperator(Scan.builder().storageEngine("
> > pg").selection(opts).build());
> >
> > // project
> > NamedExpression nex = new NamedExpression(...);
> > builder.addLogicalOperator(Project.builder().addExpr(nex));
> >
> > LogicalPlan plan = builder.build();
> >
> > Is there any documentation yet, similar to what calcite has for the
> > RelBuilder (https://calcite.apache.org/docs/algebra.html)?
> >
> > I suppose I could also construct a RelNode and then try to convert that
> to
> > a LogicalPlan, would I be missing out on any drill features this way?
> >
> > Thanks in advance
> >
>


Hangout starting now

2016-09-20 Thread Aman Sinha
Hangout link -
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Suggest topics for hangout tomorrow (9/20)

2016-09-19 Thread Aman Sinha
I'll start the hangout tomorrow at the usual time.  I don't have a set
agenda yet but if there are any topics folks wish to discuss, please
respond on this thread such that others who might be interested can also
join.

Thanks.


Re: LIMIT push down to parquet row group

2016-09-19 Thread Aman Sinha
Adding to what Jinfeng said, the LIMIT handling relies on the downstream
operator sending a 'kill incoming input stream' api which is called by the
parent operator on its child once the parent (Limit) has received the
required number of rows.   Since the unit of processing in Drill is record
batches, the downstream operator needs to wait until at least 1 batch (not
1 row) has been received.  In this case, the batch size happens to be 32K
records - this is an internal constant in Parquet reader if any of the
columns is variable width, see
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java#L69
.
I think we could enhance this behavior for LIMIT such that the internal
batch size is aware of the limit value.  Do you want to file an enhancement
JIRA ?

On Mon, Sep 19, 2016 at 7:28 AM, Jinfeng Ni  wrote:

> Drill applies LIMIT filtering at row group level.  For LIMIT n, it
> will scan the first m row groups that have at least n rows, and
> discard the rest of row groups. In your case, since you have only 1
> row group, it does not have any row group filtering for LIMIT 1.
>
> I'm not sure how 32767 comes from. It's possibly that's the size of
> the first batch, depending on the column data in your file.
>
> On Mon, Sep 19, 2016 at 7:09 AM, Veera Naranammalpuram
>  wrote:
> > Does anyone know how and if the LIMIT push down to Parquet file works?
> >
> > I have a parquet file with 53K records in 1 row group. When I run a
> SELECT
> > * from  LIMIT 1, I see the Parquet reader operator process 32768
> > records. I would have expected either 1 or 53K. So questions;
> >
> > 1) Does the Parquet MR library offer the ability to push down LIMITs to
> > Parquet files? From the above, the answer looks like yes.
> > 2) If so, how does Drill come up with the magic number 32767? Is there a
> > way I can make it read just 1 row if the query is a LIMIT 1?
> >
> > --
> > Veera Naranammalpuram
> > Product Specialist - SQL on Hadoop
> > *MapR Technologies (www.mapr.com )*
> > *(Email) vnaranammalpu...@maprtech.com *
> > *(Mobile) 917 683 8116 - can text *
> > *Timezone: ET (UTC -5:00 / -4:00)*
>


Re: Table value functions in Drill

2016-07-28 Thread Aman Sinha
Tushar,
Drill does not currently support the table functions as described (it looks
like your example is close to MS SQL Server syntax).
Drill has support for VALUES list in the FROM clause and a table with
options (see [1]) to interpret text data but neither of these match your
requirement.

[1]
https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters

On Thu, Jul 28, 2016 at 2:44 AM, Tushar Pathare  wrote:

> Hello Team,
> Any update from the developers or users or comments on this.
>
> Tushar B Pathare
> High Performance Computing (HPC) Administrator
> General Parallel File System
> Scientific Computing
> Bioinformatics Division
> Research
>
> Sidra Medical and Research Centre
> Sidra OPC Building
> PO Box 26999  |  Doha, Qatar
> Near QNCC,5th Floor
> Office 4003  ext 37443 | M +974 74793547
> tpath...@sidra.org | www.sidra.org<
> http://www.sidra.org/>
>
>
>
> From: Tushar Pathare >
> Date: Monday, July 25, 2016 at 2:01 PM
> To: "user@drill.apache.org" <
> user@drill.apache.org>
> Subject: Table value functions in Drill
>
> Like suppose I need to get the following sql function in drill
>
> create function abc ()
> returns @xyz table (id int,name varchar(100))
> as
> begin
> insert into @xyz
> select 1 id,'aaa' name
> union
> select 2 id,'bbb' name
> return
> end
>
> Tushar B Pathare
> High Performance Computing (HPC) Administrator
> General Parallel File System
> Scientific Computing
> Bioinformatics Division
> Research
>
> Sidra Medical and Research Centre
> Sidra OPC Building
> PO Box 26999  |  Doha, Qatar
> Near QNCC,5th Floor
> Office 4003  ext 37443 | M +974 74793547
> tpath...@sidra.org | www.sidra.org<
> http://www.sidra.org/>
>
>
> Disclaimer: This email and its attachments may be confidential and are
> intended solely for the use of the individual to whom it is addressed. If
> you are not the intended recipient, any reading, printing, storage,
> disclosure, copying or any other action taken in respect of this e-mail is
> prohibited and may be unlawful. If you are not the intended recipient,
> please notify the sender immediately by using the reply function and then
> permanently delete what you have received. Any views or opinions expressed
> are solely those of the author and do not necessarily represent those of
> Sidra Medical and Research Center.
>


Re: Tableau Web Data Connector

2016-07-25 Thread Aman Sinha
Steve,
As far as I know, this has not been written (or maybe someone has written
but not yet contributed).
Agree that it would certainly be a useful functionality.

-Aman

On Mon, Jul 25, 2016 at 11:17 AM, Steve Warren  wrote:

> Has anyone written a Tableau Web Data Connector for Drill? I noticed
> prestodb.io has one and it really opens up the ability to interface with
> drill over the internet.
>
> --
> Confidentiality Notice and Disclaimer:  The information contained in this
> e-mail and any attachments, is not transmitted by secure means and may also
> be legally privileged and confidential.  If you are not an intended
> recipient, you are hereby notified that any dissemination, distribution, or
> copying of this e-mail is strictly prohibited.  If you have received this
> e-mail in error, please notify the sender and permanently delete the e-mail
> and any attachments immediately. You should not retain, copy or use this
> e-mail or any attachment for any purpose, nor disclose all or any part of
> the contents to any other person. MyVest Corporation, MyVest Advisors and
> their affiliates accept no responsibility for any unauthorized access
> and/or alteration or dissemination of this communication nor for any
> consequence based on or arising out of the use of information that may have
> been illegitimately accessed or altered.
>


[ANNOUNCE] Apache Drill 1.7.0 released

2016-06-28 Thread Aman Sinha
On behalf of the Apache Drill community, I am happy to announce the
release of Apache Drill 1.7.0.

The source and binary artifacts are available at [1]
Review a complete list of fixes and enhancements at [2]

This release of Drill fixes many issues and introduces a number of
enhancements, including JMX enablement for monitoring, support for Hive
CHAR type and HBase 1.x support.

Thanks to everyone in the community who contributed to this release.

[1] https://drill.apache.org/download/
[2] https://drill.apache.org/docs/apache-drill-1-7-0-release-notes/


-Aman


Re: DRILL-4199: Add Support for HBase 1.X - planning to merge

2016-06-21 Thread Aman Sinha
; >> 01-06Project(row_key0=[$0], v0=[$1], ITEM=[$2])
> >> 01-08  *BroadcastExchange*
> >> 02-01Project(row_key=[$0], v=[$1],
> >> ITEM=[ITEM($1, 'v')])
> >> 02-02  Scan(groupscan=[HBaseGroupScan
> >> [HBaseScanSpec=HBaseScanSpec [tableName=offers_nation_idx,
> >> startRow=0br\x00, stopRow=0bs, filter=FilterList AND (2/2): [RowFilter
> >> (GREATER, 0br), RowFilter (LESS, 0bs)]], columns=[`row_key`, `v`,
> >> `v`.`v`]]])
> >>
> >>
> >> This is the plan that fails which have more than 5 nodes:
> >> 0: jdbc:drill:zk=xxx:> explain plan for select
> >> CONVERT_FROM(BYTE_SUBSTR(`ref0`.row_key,-8,8),'BIGINT_BE') as uid,
> >> convert_from(`ref0`.`v`.`v`,'UTF8') as v  from
> hbase.`offers_nation_idx` as
> >> `nation` join hbase.offers_ref0 as `ref0` on
> >>  BYTE_SUBSTR(`ref0`.row_key,-8,8) = nation.`v`.`v` where
> `nation`.row_key
> >>  > '0br' and `nation`.row_key  < '0bs' limit 10;
> >> +--+--+
> >> | text | json |
> >> +--+--+
> >> | 00-00Screen
> >> 00-01  Project(uid=[$0], v=[$1])
> >> 00-02SelectionVectorRemover
> >> 00-03  Limit(fetch=[10])
> >> 00-04UnionExchange
> >> 01-01  SelectionVectorRemover
> >> 01-02Limit(fetch=[10])
> >> 01-03
> Project(uid=[CONVERT_FROMBIGINT_BE(BYTE_SUBSTR($3,
> >> -8, 8))], v=[CONVERT_FROMUTF8(ITEM($4, 'v'))])
> >> 01-04Project(row_key=[$3], v=[$4], ITEM=[$5],
> >> row_key0=[$0], v0=[$1], $f2=[$2])
> >> 01-05  HashJoin(condition=[=($2, $5)],
> >> joinType=[inner])
> >> 01-07Project(row_key=[$0], v=[$1], $f2=[$2])
> >> 01-09  *HashToRandomExchange*(dist0=[[$2]])
> >> 02-01UnorderedMuxExchange
> >> 04-01  Project(row_key=[$0], v=[$1],
> >> $f2=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2)])
> >> 04-02Project(row_key=[$0], v=[$1],
> >> $f2=[BYTE_SUBSTR($0, -8, 8)])
> >> 04-03  Scan(groupscan=[HBaseGroupScan
> >> [HBaseScanSpec=HBaseScanSpec [tableName=offers_ref0, startRow=null,
> >> stopRow=null, filter=null], columns=[`*`]]])
> >> 01-06Project(row_key0=[$0], v0=[$1], ITEM=[$2])
> >> 01-08  Project(row_key=[$0], v=[$1], ITEM=[$2])
> >> 01-10*HashToRandomExchange*(dist0=[[$2]])
> >> 03-01  UnorderedMuxExchange
> >> 05-01Project(row_key=[$0], v=[$1],
> >> ITEM=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2)])
> >> 05-02  Project(row_key=[$0], v=[$1],
> >> ITEM=[ITEM($1, 'v')])
> >> 05-03Scan(groupscan=[HBaseGroupScan
> >> [HBaseScanSpec=HBaseScanSpec [tableName=offers_nation_idx,
> >> startRow=0br\x00, stopRow=0bs, filter=FilterList AND (2/2): [RowFilter
> >> (GREATER, 0br), RowFilter (LESS, 0bs)]], columns=[`row_key`, `v`,
> >> `v`.`v`]]])
> >>
> >> The difference is use *BroadcastExchange *and  *HashToRandomExchange.*
> >>
> >> You can create the JIRA and send me the link .
> >>
> >> Thanks.
> >>
> >>
> >> 2016-06-20 23:44 GMT+08:00 Aman Sinha <amansi...@apache.org>:
> >>
> >>> Hi Qiang,
> >>> were you seeing this same issue with the prior HBase version also ?  (I
> >>> would think this is not a regression).  It would be best to create a
> new
> >>> JIRA and attach the EXPLAIN plans for the successful and failed runs.
> >>> With
> >>> more nodes some minor fragments of the hash join may be getting empty
> >>> input
> >>> batches and I am guessing that has something to do with the
> >>> SchemaChangeException.   Someone would need to debug once you create
> the
> >>> JIRA with relevant details.
> >>>
> >>> -Aman
> >>>
> >>> On Mon, Jun 20, 2016 at 5:13 AM, qiang li <tiredqi...@gmail.com>
> wrote:
> >>>
> >>> > Thanks Aditya.
> >>> >
> >>> > By the way, I found another issue.
> >>> >
> >>> > Let say I have two tables.
&g

Re: DRILL-4199: Add Support for HBase 1.X - planning to merge

2016-06-20 Thread Aman Sinha
Hi Qiang,
were you seeing this same issue with the prior HBase version also ?  (I
would think this is not a regression).  It would be best to create a new
JIRA and attach the EXPLAIN plans for the successful and failed runs.  With
more nodes some minor fragments of the hash join may be getting empty input
batches and I am guessing that has something to do with the
SchemaChangeException.   Someone would need to debug once you create the
JIRA with relevant details.

-Aman

On Mon, Jun 20, 2016 at 5:13 AM, qiang li <tiredqi...@gmail.com> wrote:

> Thanks Aditya.
>
> By the way, I found another issue.
>
> Let say I have two tables.
>
> offers_ref0 : rowkey salt(1byte)+long uid(8 byte ) , family: v,  qualifier:
> v(string)
> offers_nation_idx: rowkey salt(1byte) + string, family:v, qualifier: v(long
> 8 byte)
>
> there is the SQL:
>
> select CONVERT_FROM(BYTE_SUBSTR(`ref0`.row_key,-8,8),'BIGINT_BE') as uid,
> convert_from(`ref0`.`v`.`v`,'UTF8') as v  from hbase.`offers_nation_idx` as
> `nation` join hbase.offers_ref0 as `ref0` on
> CONVERT_FROM(BYTE_SUBSTR(`ref0`.row_key,-8,8),'BIGINT_BE') =
> CONVERT_FROM(nation.`v`.`v`,'BIGINT_BE') where `nation`.row_key  > '0br'
> and `nation`.row_key  < '0bs' limit 10
>
> When I execute the query with single node or less than 5 nodes, its working
> good. But when I execute it in cluster which have about 14 nodes, its throw
> a exception:
>
> First time will throw this exception:
> *Caused by: java.sql.SQLException: SYSTEM ERROR: SchemaChangeException:
> Hash join does not support schema changes*
>
> Then if I query again, it will always throw below exception:
> *Query Failed: An Error Occurred*
> *org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalStateException: Failure while reading vector. Expected vector class
> of org.apache.drill.exec.vector.NullableIntVector but was holding vector
> class org.apache.drill.exec.vector.complex.MapVector, field=
> v(MAP:REQUIRED)[v(VARBINARY:OPTIONAL)[$bits$(UINT1:REQUIRED),
> v(VARBINARY:OPTIONAL)[$offsets$(UINT4:REQUIRED)]]] Fragment 12:4 [Error Id:
> 06c6eae4-0822-4714-b0bf-a6e04ebfec79 on xxx:31010]*
>
> Its very strange, and I do not know how to solve it.
> I tried add node to the cluster one by one, it will reproduce when I added
> 5 nodes. Can anyone help me solve this issue?
>
>
>
>
> 2016-06-17 4:39 GMT+08:00 Aditya <adityakish...@gmail.com>:
>
> > https://issues.apache.org/jira/browse/DRILL-4727
> >
> > On Thu, Jun 16, 2016 at 11:39 AM, Aman Sinha <amansi...@apache.org>
> wrote:
> >
> >> Qiang/Aditya can you create a JIRA for this and mark it for 1.7.
> thanks.
> >>
> >> On Thu, Jun 16, 2016 at 11:25 AM, Aditya <adityakish...@gmail.com>
> wrote:
> >>
> >> > Thanks for reporting, I'm looking into it and will post a patch soon.
> >> >
> >> > On Wed, Jun 15, 2016 at 7:27 PM, qiang li <tiredqi...@gmail.com>
> wrote:
> >> >
> >> > > Hi Aditya,
> >> > >
> >> > > I tested the latest version and got this exception and the drillbit
> >> fail
> >> > > to startup .
> >> > >
> >> > > Exception in thread "main" java.lang.NoSuchMethodError:
> >> > > io.netty.util.UniqueName.(Ljava/lang/String;)V
> >> > > at
> >> io.netty.channel.ChannelOption.(ChannelOption.java:136)
> >> > > at
> >> io.netty.channel.ChannelOption.valueOf(ChannelOption.java:99)
> >> > > at
> >> io.netty.channel.ChannelOption.(ChannelOption.java:42)
> >> > > at
> >> > > org.apache.drill.exec.rpc.BasicServer.(BasicServer.java:63)
> >> > > at
> >> > > org.apache.drill.exec.rpc.user.UserServer.(UserServer.java:74)
> >> > > at
> >> > >
> >>
> org.apache.drill.exec.service.ServiceEngine.(ServiceEngine.java:78)
> >> > > at
> >> > org.apache.drill.exec.server.Drillbit.(Drillbit.java:108)
> >> > > at
> >> org.apache.drill.exec.server.Drillbit.start(Drillbit.java:285)
> >> > > at
> >> org.apache.drill.exec.server.Drillbit.start(Drillbit.java:271)
> >> > > at
> >> org.apache.drill.exec.server.Drillbit.main(Drillbit.java:267)
> >> > >
> >> > > It will working if I remove
> jars/3rdparty/netty-all-4.0.23.Final.jar,
> >> the
> >> > > drill can startup. I think there have some package dependency
> version
> >> > > issue, do you think so ?
> >> > >
> >> > >
> >> > >
> >> > > 2016-06-15 8:14 GMT+08:00 Aditya <adityakish...@gmail.com>:
> >> > >
> >> > >> HBase 1.x support has been merged and is available in latest
> >> > >> 1.7.0-SNAPSHOT
> >> > >> builds.
> >> > >>
> >> > >> On Wed, Jun 1, 2016 at 1:23 PM, Aditya <adityakish...@gmail.com>
> >> wrote:
> >> > >>
> >> > >> > Thanks Jacques for promptly reviewing my long series of patches!
> >> > >> >
> >> > >> > I'm planning to merge the HBase 1.x support some time in next 48
> >> > hours.
> >> > >> >
> >> > >> > If anyone else is interested and willing, please review the
> latest
> >> > patch
> >> > >> > here[1].
> >> > >> >
> >> > >> > aditya...
> >> > >> >
> >> > >> > [1] https://github.com/apache/drill/pull/443/files
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>


Re: DRILL-4199: Add Support for HBase 1.X - planning to merge

2016-06-16 Thread Aman Sinha
Qiang/Aditya can you create a JIRA for this and mark it for 1.7.  thanks.

On Thu, Jun 16, 2016 at 11:25 AM, Aditya  wrote:

> Thanks for reporting, I'm looking into it and will post a patch soon.
>
> On Wed, Jun 15, 2016 at 7:27 PM, qiang li  wrote:
>
> > Hi Aditya,
> >
> > I tested the latest version and got this exception and the drillbit fail
> > to startup .
> >
> > Exception in thread "main" java.lang.NoSuchMethodError:
> > io.netty.util.UniqueName.(Ljava/lang/String;)V
> > at io.netty.channel.ChannelOption.(ChannelOption.java:136)
> > at io.netty.channel.ChannelOption.valueOf(ChannelOption.java:99)
> > at io.netty.channel.ChannelOption.(ChannelOption.java:42)
> > at
> > org.apache.drill.exec.rpc.BasicServer.(BasicServer.java:63)
> > at
> > org.apache.drill.exec.rpc.user.UserServer.(UserServer.java:74)
> > at
> > org.apache.drill.exec.service.ServiceEngine.(ServiceEngine.java:78)
> > at
> org.apache.drill.exec.server.Drillbit.(Drillbit.java:108)
> > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:285)
> > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:271)
> > at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:267)
> >
> > It will working if I remove jars/3rdparty/netty-all-4.0.23.Final.jar, the
> > drill can startup. I think there have some package dependency version
> > issue, do you think so ?
> >
> >
> >
> > 2016-06-15 8:14 GMT+08:00 Aditya :
> >
> >> HBase 1.x support has been merged and is available in latest
> >> 1.7.0-SNAPSHOT
> >> builds.
> >>
> >> On Wed, Jun 1, 2016 at 1:23 PM, Aditya  wrote:
> >>
> >> > Thanks Jacques for promptly reviewing my long series of patches!
> >> >
> >> > I'm planning to merge the HBase 1.x support some time in next 48
> hours.
> >> >
> >> > If anyone else is interested and willing, please review the latest
> patch
> >> > here[1].
> >> >
> >> > aditya...
> >> >
> >> > [1] https://github.com/apache/drill/pull/443/files
> >> >
> >>
> >
> >
>


drill hangout link..

2016-06-14 Thread Aman Sinha
hangout starting now:
https://hangouts.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Suggestions for hangout topics for 06/14

2016-06-13 Thread Aman Sinha
If you have any suggestions for Drill hangout topics for tomorrow,  you can
add it to this thread.  We will also ask around at the beginning of the
hangout for any topics.  The goal is to try to cover whatever possible
during the 1 hr.
 Couple of pending topics:
  1.  (leftover from last time) DRILL-4280: Kerberos Authentication (
Sudheesh)
  2.  planning for 1.7 release

thanks.
Aman


Re: Hangout Frequency

2016-05-20 Thread Aman Sinha
Every other week sounds good to me.  It is a substantial commitment to do
one every week.
Many useful discussions already happen on the dev and user mailing lists.

On Fri, May 20, 2016 at 12:44 PM, Parth Chandra 
wrote:

> Drill Users, Devs,
>
>   Attendance at the hangouts has been getting sparse and it seems like the
> hangouts are too frequent. I'd like to propose that we move to having
> hangouts every other week.
>
>   What do folks think?
>
> Parth
>


Re: query from hbase issue

2016-05-19 Thread Aman Sinha
Khurram,  DRILL-4686 seems like a different issue...it is reporting an
error whereas the original problem from qiang was an incorrect result.  Can
you use the same version (1.6) that he was using.  Also, is the data set
similar ? If you are unable to repro the exact same issue,  perhaps qiang
should file a JIRA with a smaller repro if possible.



On Thu, May 19, 2016 at 8:35 AM, Khurram Faraaz 
wrote:

> Hello Qiang,
>
> DRILL-4686 is reported to track this problem.
>
> Thanks,
> Khurram
>
> On Wed, May 18, 2016 at 3:16 PM, qiang li  wrote:
>
>> Ok, Thanks very much.
>>
>> 2016-05-18 17:44 GMT+08:00 Khurram Faraaz :
>>
>>> Hello Qiang,
>>>
>>> Someone from our Drill team (in San Jose) will get back to you soon. I
>>> work from the India lab and I am in a different time zone as compared to
>>> San Jose office, some one from MapR San Jose will get back to you as soon
>>> as possible.
>>>
>>> Thanks,
>>> Khurram
>>>
>>> On Wed, May 18, 2016 at 3:09 PM, qiang li  wrote:
>>>
 Hi Khurram, Thanks very much to reproduce it, so what's the
 conclusion?

 Any idea how to sovle it?

 2016-05-18 17:02 GMT+08:00 Khurram Faraaz :

> So I tried to create the table using HBase API (with no data inserted
> into table) and I got the query plan for drill 1.7.0
> Drill 1.7.0-SNAPSHOT  commit ID :  09b26277
>
> 0: jdbc:drill:schema=dfs.tmp> describe browser_action2;
> +--++--+
> | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> +--++--+
> | row_key  | ANY| NO   |
> | v| MAP| NO   |
> +--++--+
> 2 rows selected (1.665 seconds)
>
> Table creation Java program
>
> {noformat}
> public class PutIntDataToHBase {
> public static void main(String args[]) throws IOException {
> Configuration conf = HBaseConfiguration.create();
> conf.set("hbase.zookeeper.property.clientPort","5181");
> HBaseAdmin admin = new HBaseAdmin(conf);
> if (admin.tableExists("browser_action2")) {
> admin.disableTable("browser_action2");
> admin.deleteTable("browser_action2");
> }
>
> byte[][] SPLIT_KEYS =
> {{'0'},{'1'},{'2'},{'3'},{'4'},{'5'},{'6'},{'7'},{'8'},{'9'}};
> HTableDescriptor tableDesc = new
> HTableDescriptor(TableName.valueOf("browser_action2"));
>
> tableDesc.addFamily(new HColumnDescriptor("v"));
> admin.createTable(tableDesc,SPLIT_KEYS);
>
> }
> }
> {noformat}
>
> Query plan for the query that was reported as returning wrong results.
>
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select
> CONVERT_FROM(a.`v`.`e0`, 'UTF8') as k, count(a.`v`.`e0`) p from
> hbase.browser_action2 a where a.row_key > '0'  group by a.`v`.`e0`;
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  Project(k=[$0], p=[$1])
> 00-02UnionExchange
> 01-01  Project(k=[CONVERT_FROMUTF8($0)], p=[$1])
> 01-02HashAgg(group=[{0}], p=[$SUM0($1)])
> 01-03  Project($f0=[$0], p=[$1])
> 01-04HashToRandomExchange(dist0=[[$0]])
> 02-01  UnorderedMuxExchange
> 03-01Project($f0=[$0], p=[$1],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
> 03-02  HashAgg(group=[{0}], p=[COUNT($0)])
> 03-03Project($f0=[ITEM($1, 'e0')])
> 03-04  Scan(groupscan=[HBaseGroupScan
> [HBaseScanSpec=HBaseScanSpec [tableName=browser_action2, startRow=0\x00,
> stopRow=, filter=null], columns=[`*`]]])
> {noformat}
>
> and the query plan for the other problem query mentioned in the first
> email.
>
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select
> CONVERT_FROM(BYTE_SUBSTR(a.row_key, 1 , 9), 'UTF8')  as k,
> . . . . . . . . . . . . . . > count(a.row_key) p from
> hbase.browser_action2 a group by
> . . . . . . . . . . . . . . > BYTE_SUBSTR(a.row_key, 1 , 9);
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  Project(k=[$0], p=[$1])
> 00-02UnionExchange
> 01-01  Project(k=[CONVERT_FROMUTF8($0)], p=[$1])
> 01-02HashAgg(group=[{0}], p=[$SUM0($1)])
> 01-03  Project($f0=[$0], p=[$1])
> 01-04HashToRandomExchange(dist0=[[$0]])
> 02-01  UnorderedMuxExchange
> 03-01Project($f0=[$0], p=[$1],
> 

Re: Re: join fail

2016-05-11 Thread Aman Sinha
Hi,
good to know that the join succeeds with the hash distribute plan.  For the
broadcast join, I agree that the memory footprint is higher
than what it should be.   I thought we had a JIRA for this but I searched
and could not find one, so I have created it:
https://issues.apache.org/jira/browse/DRILL-4667.
Pls take a look at the explanation.
In the meantime, I would suggest 2 things:
(a) what is the setting for planner.width.max_per_node on your machine ?
Can you reduce it by half just for this query and try .
(b) for the pageurl if you know the max length, you might want to limit the
VARCHAR length, although I suppose you already created the Hive table
with String type (which gets converted to varchar(64K))

-Aman

On Tue, May 10, 2016 at 9:44 PM, lizhenm...@163.com <lizhenm...@163.com>
wrote:

> thank Andries and Aman to help  me.
>
> 1. when i  increase planner.memory.max_query_memory_per_node to 4G,8G, and
> it is not work.
> 2. As Aman said, after i run the analyze on the table, the rowcount of the
> scan is equal the true table, but the join fail yet.
> 3.The join is successful when use distribute plan.
> 3. i just want to know why the broadcast  join falied in the condition
> that the  size of the table that in  join right side is small than the size
> of cluster's total memory
>
>
>
> lizhenm...@163.com
>
> From: Aman Sinha
> Date: 2016-05-10 23:35
> To: user
> Subject: Re: join fail
> It's difficult to debug this type of issue over email thread.  However, 2
> observations:
> 1.  The following Scan which is the table that is broadcast shows a
> rowcount of 1.3M rows whereas your original email says the rowcount is 32M
> rows.  Are you sure Can you confirm what is the correct row count ?  Has
> ANALYZE been run on the Hive table ?
>
> > Scan(groupscan=[HiveScan [table=Table(dbName:default,
> > tableName:rankings_huge), columns=[`pageurl`, `pagerank`],
> numPartitions=0,
> > partitions= null,
> inputDirectories=[hdfs://nameservice1/HiBench/huge/Join/Input/rankings]]])
> > : rowType = RecordType(VARCHAR(65535) pageurl, INTEGER pagerank):
> *rowcount
> > = 1364388.0*
>
>
> 2. Why are you even trying to do a broadcast join for this query ?  Did the
> hash distribute join plan succeed or not ?
>
> -Aman
>
> On Tue, May 10, 2016 at 8:19 AM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
> > See if increasing planner.memory.max_query_memory_per_node helps. It is
> > set to 2G by default.
> >
> > Also see
> >
> > https://community.mapr.com/thread/10329 <
> > https://community.mapr.com/thread/10329>
> > https://community.mapr.com/docs/DOC-1497 <
> > https://community.mapr.com/docs/DOC-1497>
> >
> >
> > --Andries
> >
> >
> > > On May 10, 2016, at 12:06 AM, leezy <lizhenm...@163.com> wrote:
> > >
> > >
> > >
> > > I am sorry the picture cannot be read. here is the text:
> > >
> > >
> > >
> > >
> > > | Minor Fragment ID | Host Name | Start | End | Runtime | Max Records |
> > Max Batches | Last Update | Last Progress | Peak Memory | State |
> > > | 05-00-xx | bigdata2 | 1.529s | 24m46s | 24m45s | 71,555,857 | 21,985
> |
> > 12:16:02 | 12:16:02 | 7.68GB | CANCELLATION_REQUESTED |
> > > | 05-01-xx | bigdata4 | 1.566s | 24m47s | 24m45s | 71,567,860 | 21,990
> |
> > 12:16:02 | 12:16:02 | 7.69GB | CANCELLATION_REQUESTED |
> > > | 05-02-xx | bigdata3 | 1.526s | 22m15s | 22m13s | 71,551,338 | 21,981
> |
> > 12:13:30 | 12:13:25 | 7.68GB | RUNNING |
> > > | 05-03-xx | bigdata2 | 1.530s | 25m16s | 25m15s | 71,565,965 | 21,982
> |
> > 12:16:32 | 12:16:32 | 7.70GB | CANCELLATION_REQUESTED |
> > > | 05-04-xx | bigdata4 | 1.567s | 25m17s | 25m15s | 71,556,216 | 21,988
> |
> > 12:16:32 | 12:16:32 | 7.70GB | CANCELLATION_REQUESTED |
> > > | 05-05-xx | bigdata3 | 1.527s | 22m15s | 22m13s | 71,550,821 | 21,983
> |
> > 12:13:30 | 12:13:25 | 7.68GB | RUNNING |
> > > | 05-06-xx | bigdata2 | 1.531s | 25m46s | 25m45s | 71,558,873 | 21,983
> |
> > 12:17:02 | 12:17:02 | 7.70GB | CANCELLATION_REQUESTED |
> > > | 05-07-xx | bigdata4 | 1.568s | 25m47s | 25m45s | 71,558,657 | 21,982
> |
> > 12:17:02 | 12:17:02 | 7.67GB | CANCELLATION_REQUESTED |
> > > | 05-08-xx | bigdata3 | 1.528s | 22m15s | 22m13s | 71,558,109 | 21,989
> |
> > 12:13:30 | 12:13:25 | 7.69GB | RUNNING |
> > > | 05-09-xx | bigdata2 | 1.532s | 26m17s | 26m15s | 71,558,226 | 21,983
> |
> > 12:17:32 | 12:17:32 | 7.68GB | CANCELLATION_REQUESTED |
> > > | 05-10-xx | bigdata4 | 1.568s | 26m17s | 26m15s | 71,558,359 | 21,980
&g

Re: Partition reading problem (like operator) while using hive partition table in drill

2016-05-10 Thread Aman Sinha
The Drill test team was able to repro this and is now filed as:
https://issues.apache.org/jira/browse/DRILL-4665

On Tue, May 10, 2016 at 8:16 AM, Aman Sinha <amansi...@apache.org> wrote:

> This is supposed to work, especially since LIKE predicate is not even on
> the partitioning column (it should work either way).  I did a quick test
> with file system tables and it works for LIKE conditions.  Not sure yet
> about Hive tables.  Could you pls file a JIRA and we'll follow up.
> Thanks.
>
> -Aman
>
> On Tue, May 10, 2016 at 1:09 AM, Shankar Mane <shankar.m...@games24x7.com>
> wrote:
>
>> Problem:
>>
>> 1. In drill, we are using hive partition table. But explain plan (same
>> query) for like and = operator differs and used all partitions in case of
>> like operator.
>> 2. If you see below drill explain plans: Like operator uses *all*
>> partitions where
>> = operator uses *only* partition filtered by log_date condition.
>>
>> FYI- We are storing our logs in hive partition table (parquet,
>> gz-compressed). Each partition is having ~15 GB data. Below is the
>> describe
>> statement output from hive:
>>
>>
>> / Hive
>>
>> /
>> hive> desc hive_kafkalogs_daily ;
>> OK
>> col_name data_type comment
>> sessionid   string
>> ajaxurl string
>>
>> log_date string
>>
>> # Partition Information
>> # col_name data_type   comment
>>
>> log_date string
>>
>>
>>
>>
>> /* Drill
>> Plan (query with LIKE)
>>
>> ***/
>>
>> explain plan for select sessionid, servertime, ajaxUrl from
>> hive.hive_kafkalogs_daily where log_date = '2016-05-09' and ajaxUrl like
>> '%utm_source%' limit 1 ;
>>
>> +--+--+
>> | text | json |
>> +--+--+
>> | 00-00Screen
>> 00-01  Project(sessionid=[$0], servertime=[$1], ajaxUrl=[$2])
>> 00-02SelectionVectorRemover
>> 00-03  Limit(fetch=[1])
>> 00-04UnionExchange
>> 01-01  SelectionVectorRemover
>> 01-02Limit(fetch=[1])
>> 01-03  Project(sessionid=[$0], servertime=[$1],
>> ajaxUrl=[$2])
>> 01-04SelectionVectorRemover
>> 01-05  Filter(condition=[AND(=($3, '2016-05-09'),
>> LIKE($2, '%utm_source%'))])
>> 01-06Scan(groupscan=[HiveScan
>> [table=Table(dbName:default, tableName:hive_kafkalogs_daily),
>> columns=[`sessionid`, `servertime`, `ajaxurl`, `log_date`],
>> numPartitions=29, partitions= [Partition(values:[2016-04-11]),
>> Partition(values:[2016-04-12]), Partition(values:[2016-04-13]),
>> Partition(values:[2016-04-14]), Partition(values:[2016-04-15]),
>> Partition(values:[2016-04-16]), Partition(values:[2016-04-17]),
>> Partition(values:[2016-04-18]), Partition(values:[2016-04-19]),
>> Partition(values:[2016-04-20]), Partition(values:[2016-04-21]),
>> Partition(values:[2016-04-22]), Partition(values:[2016-04-23]),
>> Partition(values:[2016-04-24]), Partition(values:[2016-04-25]),
>> Partition(values:[2016-04-26]), Partition(values:[2016-04-27]),
>> Partition(values:[2016-04-28]), Partition(values:[2016-04-29]),
>> Partition(values:[2016-04-30]), Partition(values:[2016-05-01]),
>> Partition(values:[2016-05-02]), Partition(values:[2016-05-03]),
>> Partition(values:[2016-05-04]), Partition(values:[2016-05-05]),
>> Partition(values:[2016-05-06]), Partition(values:[2016-05-07]),
>> Partition(values:[2016-05-08]), Partition(values:[2016-05-09])],
>>
>> inputDirectories=[hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160411,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160412,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160413,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160414,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160415,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160416,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160417,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160418,
>> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160419,
>> hdfs://namenode:9

Re: join fail

2016-05-10 Thread Aman Sinha
It's difficult to debug this type of issue over email thread.  However, 2
observations:
1.  The following Scan which is the table that is broadcast shows a
rowcount of 1.3M rows whereas your original email says the rowcount is 32M
rows.  Are you sure Can you confirm what is the correct row count ?  Has
ANALYZE been run on the Hive table ?

> Scan(groupscan=[HiveScan [table=Table(dbName:default,
> tableName:rankings_huge), columns=[`pageurl`, `pagerank`], numPartitions=0,
> partitions= null, 
> inputDirectories=[hdfs://nameservice1/HiBench/huge/Join/Input/rankings]]])
> : rowType = RecordType(VARCHAR(65535) pageurl, INTEGER pagerank): *rowcount
> = 1364388.0*


2. Why are you even trying to do a broadcast join for this query ?  Did the
hash distribute join plan succeed or not ?

-Aman

On Tue, May 10, 2016 at 8:19 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> See if increasing planner.memory.max_query_memory_per_node helps. It is
> set to 2G by default.
>
> Also see
>
> https://community.mapr.com/thread/10329 <
> https://community.mapr.com/thread/10329>
> https://community.mapr.com/docs/DOC-1497 <
> https://community.mapr.com/docs/DOC-1497>
>
>
> --Andries
>
>
> > On May 10, 2016, at 12:06 AM, leezy  wrote:
> >
> >
> >
> > I am sorry the picture cannot be read. here is the text:
> >
> >
> >
> >
> > | Minor Fragment ID | Host Name | Start | End | Runtime | Max Records |
> Max Batches | Last Update | Last Progress | Peak Memory | State |
> > | 05-00-xx | bigdata2 | 1.529s | 24m46s | 24m45s | 71,555,857 | 21,985 |
> 12:16:02 | 12:16:02 | 7.68GB | CANCELLATION_REQUESTED |
> > | 05-01-xx | bigdata4 | 1.566s | 24m47s | 24m45s | 71,567,860 | 21,990 |
> 12:16:02 | 12:16:02 | 7.69GB | CANCELLATION_REQUESTED |
> > | 05-02-xx | bigdata3 | 1.526s | 22m15s | 22m13s | 71,551,338 | 21,981 |
> 12:13:30 | 12:13:25 | 7.68GB | RUNNING |
> > | 05-03-xx | bigdata2 | 1.530s | 25m16s | 25m15s | 71,565,965 | 21,982 |
> 12:16:32 | 12:16:32 | 7.70GB | CANCELLATION_REQUESTED |
> > | 05-04-xx | bigdata4 | 1.567s | 25m17s | 25m15s | 71,556,216 | 21,988 |
> 12:16:32 | 12:16:32 | 7.70GB | CANCELLATION_REQUESTED |
> > | 05-05-xx | bigdata3 | 1.527s | 22m15s | 22m13s | 71,550,821 | 21,983 |
> 12:13:30 | 12:13:25 | 7.68GB | RUNNING |
> > | 05-06-xx | bigdata2 | 1.531s | 25m46s | 25m45s | 71,558,873 | 21,983 |
> 12:17:02 | 12:17:02 | 7.70GB | CANCELLATION_REQUESTED |
> > | 05-07-xx | bigdata4 | 1.568s | 25m47s | 25m45s | 71,558,657 | 21,982 |
> 12:17:02 | 12:17:02 | 7.67GB | CANCELLATION_REQUESTED |
> > | 05-08-xx | bigdata3 | 1.528s | 22m15s | 22m13s | 71,558,109 | 21,989 |
> 12:13:30 | 12:13:25 | 7.69GB | RUNNING |
> > | 05-09-xx | bigdata2 | 1.532s | 26m17s | 26m15s | 71,558,226 | 21,983 |
> 12:17:32 | 12:17:32 | 7.68GB | CANCELLATION_REQUESTED |
> > | 05-10-xx | bigdata4 | 1.568s | 26m17s | 26m15s | 71,558,359 | 21,980 |
> 12:17:32 | 12:17:32 | 7.68GB | CANCELLATION_REQUESTED |
> > | 05-11-xx | bigdata3 | 1.529s | 22m15s | 22m13s | 71,537,409 | 21,975 |
> 12:13:30 | 12:13:30 | 7.69GB | RUNNING |
> > | 05-12-xx | bigdata2 | 1.533s | 26m47s | 26m45s | 71,037,058 | 21,800 |
> 12:18:02 | 12:18:02 | 7.63GB | CANCELLATION_REQUESTED |
> > | 05-13-xx | bigdata4 | 1.569s | 26m47s | 26m45s | 71,040,788 | 21,805 |
> 12:18:02 | 12:18:02 | 7.64GB | CANCELLATION_REQUESTED |
> > | 05-14-xx | bigdata3 | 1.529s | 22m15s | 22m13s | 71,001,401 | 21,791 |
> 12:13:30 | 12:13:25 | 7.65GB | RUNNING |
> > | 05-15-xx | bigdata2 | 1.533s | 27m17s | 27m15s | 71,007,623 | 21,793 |
> 12:18:32 | 12:18:32 | 7.62GB | CANCELLATION_REQUESTED |
> > | 05-16-xx | bigdata4 | 1.570s | 27m17s | 27m15s | 71,029,560 | 21,793 |
> 12:18:32 | 12:18:32 | 7.63GB | CANCELLATION_REQUESTED |
> > | 05-17-xx | bigdata3 | 1.530s | 22m15s | 22m13s | 71,057,938 | 21,806 |
> 12:13:30 | 12:13:30 | 7.64GB | RUNNING |
> >
> >
> >
> > --
> >
> >
> > thanks for your regards.
> >
> >
> > At 2016-05-10 15:01:14, "leezy"  wrote:
> >
> > here is the failure profiles. And i see the memory is not exceed 55G in
> each node.
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> >
> > thanks for your regards.
> >
> >
> >
> >
> > At 2016-05-10 14:47:10, "leezy"  wrote:
> >> Leon,thank you for your reply, yes, i have set the driver memory to 55G
> and the java heap memory to 8G. And this user case is run successfully in
> the impala that installed on the same cluster. But in drill , the cmd
> always show the follow errors:
> >> Error: RESOURCE ERROR: One or more nodes ran out of memory while
> executing the query.
> >>
> >> Failure allocating buffer.
> >> Fragment 3:6
> >>
> >> [Error Id: d623b6b2-279d-4d24-af4d-0b62554b440c on bigdata2:31010]
> (state=,code=0)
> >>
> >>
> >> i think i miss some configuration.
> >>
> >>
> >>
> >>
> >> --
> >>
> >>
> >> thanks for your regards.
> >>
> >>
> >>
> >>
> >> At 2016-05-10 13:51:12, "Leon Clayton"  wrote:
> >>> did you increase the memory setting for Drill from the default?
> >>>

Re: Partition reading problem (like operator) while using hive partition table in drill

2016-05-10 Thread Aman Sinha
This is supposed to work, especially since LIKE predicate is not even on
the partitioning column (it should work either way).  I did a quick test
with file system tables and it works for LIKE conditions.  Not sure yet
about Hive tables.  Could you pls file a JIRA and we'll follow up.
Thanks.

-Aman

On Tue, May 10, 2016 at 1:09 AM, Shankar Mane 
wrote:

> Problem:
>
> 1. In drill, we are using hive partition table. But explain plan (same
> query) for like and = operator differs and used all partitions in case of
> like operator.
> 2. If you see below drill explain plans: Like operator uses *all*
> partitions where
> = operator uses *only* partition filtered by log_date condition.
>
> FYI- We are storing our logs in hive partition table (parquet,
> gz-compressed). Each partition is having ~15 GB data. Below is the describe
> statement output from hive:
>
>
> / Hive
>
> /
> hive> desc hive_kafkalogs_daily ;
> OK
> col_name data_type comment
> sessionid   string
> ajaxurl string
>
> log_date string
>
> # Partition Information
> # col_name data_type   comment
>
> log_date string
>
>
>
>
> /* Drill
> Plan (query with LIKE)
>
> ***/
>
> explain plan for select sessionid, servertime, ajaxUrl from
> hive.hive_kafkalogs_daily where log_date = '2016-05-09' and ajaxUrl like
> '%utm_source%' limit 1 ;
>
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  Project(sessionid=[$0], servertime=[$1], ajaxUrl=[$2])
> 00-02SelectionVectorRemover
> 00-03  Limit(fetch=[1])
> 00-04UnionExchange
> 01-01  SelectionVectorRemover
> 01-02Limit(fetch=[1])
> 01-03  Project(sessionid=[$0], servertime=[$1],
> ajaxUrl=[$2])
> 01-04SelectionVectorRemover
> 01-05  Filter(condition=[AND(=($3, '2016-05-09'),
> LIKE($2, '%utm_source%'))])
> 01-06Scan(groupscan=[HiveScan
> [table=Table(dbName:default, tableName:hive_kafkalogs_daily),
> columns=[`sessionid`, `servertime`, `ajaxurl`, `log_date`],
> numPartitions=29, partitions= [Partition(values:[2016-04-11]),
> Partition(values:[2016-04-12]), Partition(values:[2016-04-13]),
> Partition(values:[2016-04-14]), Partition(values:[2016-04-15]),
> Partition(values:[2016-04-16]), Partition(values:[2016-04-17]),
> Partition(values:[2016-04-18]), Partition(values:[2016-04-19]),
> Partition(values:[2016-04-20]), Partition(values:[2016-04-21]),
> Partition(values:[2016-04-22]), Partition(values:[2016-04-23]),
> Partition(values:[2016-04-24]), Partition(values:[2016-04-25]),
> Partition(values:[2016-04-26]), Partition(values:[2016-04-27]),
> Partition(values:[2016-04-28]), Partition(values:[2016-04-29]),
> Partition(values:[2016-04-30]), Partition(values:[2016-05-01]),
> Partition(values:[2016-05-02]), Partition(values:[2016-05-03]),
> Partition(values:[2016-05-04]), Partition(values:[2016-05-05]),
> Partition(values:[2016-05-06]), Partition(values:[2016-05-07]),
> Partition(values:[2016-05-08]), Partition(values:[2016-05-09])],
>
> inputDirectories=[hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160411,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160412,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160413,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160414,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160415,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160416,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160417,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160418,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160419,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160420,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160421,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160422,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160423,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160424,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160425,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160426,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160427,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160428,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160429,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160430,
> hdfs://namenode:9000/usr/hive/warehouse/hive_kafkalogs_daily_20160501,
> 

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Aman Sinha
There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown
into parquet scan).  This is still work in progress I believe.  Once that
is implemented, the scan will produce the filtered rows only.

Regarding column projections, currently in Drill, the columns referenced
anywhere in the query (including SELECT list) need to be produced by the
table scan, so the scan will read all those columns, not just the ones in
the filter condition.   You can see what columns are being produced by the
Scan node from the EXPLAIN plan.

What would help for the SELECT * case is* late materialization of columns*.
 i.e even if the filter does not get pushed down into scan,  we could read
only the 'id' column from the table first, do the filtering that supposedly
selects 1 row, then do a late materialization of all other columns just for
that 1 row by using a row-id based lookup (if the underlying storage format
supports rowid based lookup).   This would be a feature request..I am not
sure if a JIRA already exists for it or not.

-Aman

On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning  wrote:

> I just replicated these results. Full table scans with aggregation take
> pretty much exactly the same amount of time with or without filtering.
>
>
>
> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> jzillm...@googlemail.com
> > wrote:
>
> > Hey Ted,
> >
> > Sorry i mixed up row and column!
> >
> > Queries are like that:
> > (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> > (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> >
> > (1) is 14 sec and (2) is 1.5 sec.
> > Using drill-1.6.
> > So it looks like Drill is extracting the columns before filtering which i
> > didn’t expect…
> > Is there anyway to change that behaviour ?
> >
> > Johannes
> >
> >
> >
> > > On 11 Apr 2016, at 16:42, Ted Dunning  wrote:
> > >
> > > Did you mean that you are doing a select to find a single column? What
> > you
> > > typed was row, but that seems out of line with the rest of what you
> > wrote.
> > >
> > > If you are truly asking about filtering down to a single row, whether
> it
> > > costs more to return all of the columns rather than just one from a
> > single
> > > row will depend on whether Drill is extracting columns before filtering
> > or
> > > after.
> > >
> > >
> > >
> > > On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> > jzillm...@googlemail.com
> > >> wrote:
> > >
> > >> Hey there,
> > >>
> > >> i currently doing some performance measurements on Drill.
> > >> In my case its a single parquet file with a single local Drill Bit.
> > >>
> > >> Now in one case i have unexpected results and i’m curious if somebody
> > has
> > >> a good explanation for it!
> > >>
> > >> So i have a file with 10 mio rows with 9 columns .
> > >> Now i’m doing a select statement to find one single row.
> > >> Runtime with select * : ~ 14.79 s
> > >> Runtime with select(filterField) : ~ 1.5 sec
> > >>
> > >> So i’m surprised that there is so much variance depending on the
> fields
> > i
> > >> select, since i thought Drill needs most time for finding that one
> > element,
> > >> and then deserialize the other fields only on a hit…
> > >> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> > >>
> > >> best
> > >> Johannes
> > >>
> > >>
> >
> >
>


Re: Query Planning and Directory Pruning

2016-02-09 Thread Aman Sinha
At a glance, John's query does not have a WHERE clause..it is querying the
subdirectory directly in the FROM clause..in this case Drill will only look
at the files within that subdirectory.  Directory pruning only comes into
the picture when there is a WHERE condition on dir0, dir1 etc.


On Tue, Feb 9, 2016 at 8:25 AM, Neeraja Rentachintala <
nrentachint...@maprtech.com> wrote:

> Yes, Drill-3759 covers it.
> This is a high priority enhancement that we are trying to get to in the
> next couple of releases.
>
> -Neeraja
>
> On Tue, Feb 9, 2016 at 7:32 AM, John Omernik  wrote:
>
> > This one seems to cover it:
> >
> > https://issues.apache.org/jira/browse/DRILL-3759
> >
> >
> >
> > On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > >
> > wrote:
> >
> > > Hi John,
> > >
> > > Sorry I didn't get back to you (I thought I did).
> > >
> > > No, I don't need the plan, I just wanted to confirm what was taking
> most
> > of
> > > the time and you already confirmed it's the planning.
> > >
> > > Can you open a JIRA for this ? this may be a known issue, but I'm not
> > sure.
> > >
> > > Thanks
> > >
> > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik  wrote:
> > >
> > > > Abdel, do you still need the plans, as I said, if your table has any
> > > decent
> > > > amount of directories and files, it looks like the planning is
> touching
> > > all
> > > > the directories even though you are pruning.  I can post plans,
> > however,
> > > I
> > > > think in this case you'll find they are exactly the same, and the
> only
> > > > difference is that the longer queries is planning much more because
> it
> > > has
> > > > more files to read.
> > > >
> > > >
> > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik 
> > wrote:
> > > >
> > > > > I can package up both plans for you if you need them (let me know
> if
> > > you
> > > > > still want them) but I can tell you the plans were EXACTLY the
> same,
> > > > > however the data-sum table took 0.932 seconds to plan the query,
> and
> > > the
> > > > > data table (the one with the all the extra data) took 11.379
> seconds
> > to
> > > > > plan the query. Indicating to me the issue isn't in the plan that
> was
> > > > > created, but the actual planning process. (Let me know if you
> > disagree
> > > or
> > > > > still need to see the plan, like I said, the actual plans were
> > exactly
> > > > the
> > > > > same)
> > > > >
> > > > >
> > > > > John.
> > > > >
> > > > >
> > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com> wrote:
> > > > >
> > > > >> Hey John, can you try an explain plan for both queries and see how
> > > much
> > > > >> times it takes ?
> > > > >>
> > > > >> for example, for the first query you would run:
> > > > >>
> > > > >> *explain plan for* select count(1) from `data/2016-02-03`;
> > > > >>
> > > > >> It can also be helpful if you could share the query profiles for
> > both
> > > > >> queries.
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik 
> > > wrote:
> > > > >>
> > > > >> > Hey all, I think am I seeing an issue related to
> > > > >> > https://issues.apache.org/jira/browse/DRILL-3759 but I want to
> > > > >> describe it
> > > > >> > out here, see if it's really the case, and then determine what
> the
> > > > >> blockers
> > > > >> > may be to resolution.
> > > > >> >
> > > > >> > I am using the MapR Developer Release 1.4, and I have a
> directory
> > > with
> > > > >> > subdirectories by data.
> > > > >> >
> > > > >> > data/2015-01-01
> > > > >> > data/2015-01-02
> > > > >> > data/2015-01-03
> > > > >> >
> > > > >> > These are stored as Parquet files.  At this point Each data
> > averages
> > > > >> about
> > > > >> > 1 GB of data, and has roughly 75 parquet files in it.
> > > > >> >
> > > > >> > When I run
> > > > >> >
> > > > >> > select count(1) from `data/2016-02-03` it takes roughly 11
> > seconds.
> > > > >> >
> > > > >> > If I copy the 2016-02-03 directory to a new base (date-sum) and
> > run
> > > > >> >
> > > > >> > select count(1) from `data_sum/2016-02-03` it runs in 0.874
> > seconds.
> > > > >> >
> > > > >> > Same data, same structure, only difference is the data_sum
> > directory
> > > > >> only
> > > > >> > has a few directories, iand data has dates going back to Nov
> 2015.
> > > It
> > > > >> > seems like it is getting files name for all files in each
> > directory
> > > > >> prior
> > > > >> > to pruning which seems to me to be adding a lot of latency to
> > > queries
> > > > >> that
> > > > >> > doesn't need to be there.  (thus I think I am seeing 3759) but I
> > > > wanted
> > > > >> to
> > > > >> > confirm, and then I wanted to see how we can address this in
> that
> > > the
> > > > >> > directory prune should be fast, and on large data sets its just
> > > going
> > > > to
> > > > >> > get worse and worse.
> > > > >> >
> > > > >> >
> > > > >> >
> > 

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

2016-01-16 Thread Aman Sinha
DRILL-3929 covers the related discussion regarding secondary indexing.


Re: issue with where clause

2015-12-22 Thread Aman Sinha
Nirav,
we would need some more details but this looks like a bug ... could you pls
create a JIRA along with the stack trace for this error? (it should be in
the drillbit.log file).   Alternatively, set the following
in your sqlline session:
   alter session set  `exec.errors.verbose` = true;
and run your query... and post the stack trace from there.

Pls put some details about the nature of the table:  is it parquet data ?
 how was it created ? any partitioning columns ? etc.

Aman

On Mon, Dec 21, 2015 at 8:43 PM, Nirav Shah 
wrote:

> Can someone help me on this?
>
>
> On Mon, Dec 21, 2015 at 2:19 PM, Nirav Shah 
> wrote:
>
> > Hi,
> >
> >
> > I am put up in strange situation.
> >
> > select count(distinct case when misfmg = 'true' then sessionid end)  nnn
> > from   t196 t  where channelid = 2 and id = '/confirmDrop/btnYes/' and
> > event='click'and mprizetype= 1 and mgametype = 'cfp'  ;
> >
> > is running fine and giving expected results but
> >
> > select count(distinct sessionid) from   t196 t  where channelid = 2 and
> id
> > = '/confirmDrop/btnYes/' and  event='click'and mprizetype= 1 and
> > mgametype = 'cfp' and misfmg is null ;
> >
> > is giving below error.
> >
> > Error: SYSTEM ERROR: StringIndexOutOfBoundsException: String index out of
> > range: -5
> >
> > while I remove this where condition it works fine.
> >
> > select count(distinct sessionid) from   t196 t  where channelid = 2 and
> id
> > = '/confirmDrop/btnYes/' and  event='click'and mprizetype= 1 and
> > mgametype = 'cfp' ;
> >
> > Something really strange to me.
> >
> > Thanks in advance.
> >
> > Regards,
> > Nirav
> >
>


Re: Column aliases lost when using Dates + GROUP BY in SQL

2015-12-21 Thread Aman Sinha
The aliases work for me in the following query similar to yours.  However,
I am using latest master branch and running directly through sqlline
command, not through Tableau.  Can you confirm what Drill version you are
using and check if you can repro the behavior through sqlline ?  If so, you
should file a JIRA.

select o_custkey as x, date_add(to_date(o_orderdate), 1) as y from
cp.`tpch/orders.parquet` where o_orderkey < 10 group by o_custkey,
date_add(to_date(o_orderdate), 1) limit 2;
+--+-+
|  x   |  y  |
+--+-+
| 370  | 1996-01-03  |
| 781  | 1996-12-02  |
+--+-+

On Mon, Dec 21, 2015 at 2:18 AM, Petri Lehtonen 
wrote:

> Hi all,
>
> I'm using Tableau with MapR Drill ODBC drivers.  I have successfully set
> up the drivers and am able to query the data from our source which is
> basically parquet files. All seems well as long as I'm using all rows from
> the source but when I try to limit the size of the results by grouping and
> rolling up the data I came across something that feels like a bug:
>
> When I use any functions that return dates and add a group by to my query,
> the column alias in the underlying sql won't show up in the resultset.
>
> Plain select from the source works:
>
> SELECT
>   valid_from as valid_from_raw,
>   date_add(to_date(valid_from),1)as valid_from_to_date
> FROM
>   dw.`/directory/parquet-file`
> WHERE
>   valid_from = '135699120'
>
> ---+-
> valid_from_raw | valid_from_to_date
> ---+-
> 135699120  | 2013-01-01
> ---+-
>
> When you add GROUP BY to the SQL, Drill looses the original column alias
> and replaces it with it's internal variable or something similar ($f1..n):
>
> SELECT
>   valid_from as valid_from_raw,
>   date_add(to_date(valid_from),1) as valid_from_to_date
> FROM
>   dw.`/directory/parquet-file`
> WHERE
>   valid_from = '135699120'
> GROUP BY
>   valid_from as valid_from_raw,
>   date_add(to_date(valid_from),1)
>
> ---+-
> valid_from_raw | $f1
> ---+-
> 135699120  | 2013-01-01
> ---+-
>
> And naturally this creates a problem with Tableau which is mapping the
> column aliases from the initial SQL and fails to find the corresponding
> aliases from the resultset. Only way to get this working is to use $f0..n
> column aliases in the original query and rename them in Tableau...
> Something that one definitely shouldn't do. I'd appreciate if anyone has a
> solution or a more solid workaround.
>
> Thanks,
> Petri
>
>


Re: Drill and Parquet - Best practices - part 1

2015-11-02 Thread Aman Sinha
> >- If I have multiple files containing a days worth of logging, in
> >chronological order, will all the irrelevant files be ignored when
> looking
> >for a data or a date range?
> >- AKA - Will the min-max headers in Parquet be used to prevent
> >scanning of data outside the range?

If these files were created using CTAS auto-partitioning, Drill will apply
partition pruning to eliminate files that are not needed.  If they were not
created this way, currently the files are not eliminated...this is
something that should be addressed by DRILL-1950; this has been on the list
of important JIRAs and hopefully will get addressed in one of the upcoming
releases.

> >- Is there anything I need to do to make sure that the write
> >optimizations in Parquet are used?
> >- dictionaries for low cardinality fields

Note that the default value of store.parquet.enable_dictionary_encoding  is
current False.   There were some issues with dictionary encoding in the
past with certain data types such as date;  I thought they were fixed...we
should discuss on the Drill dev list about whether this can be enabled
(there is substantial testing effort to make sure the encoding works for
all data types).

Aman




On Sun, Nov 1, 2015 at 8:25 AM, John Omernik  wrote:

> I've read through your post and had similar thoughts on trying to get
> together information around Parquet files .  I feel like it would be really
> helpful for to have a section of the Drill User Docs dedicated to user
> stories on Parquet files.  I know stories sound odd to put into
> documentation, but I think that the challenge of explaining optimization of
> something like Parquet is you can either do it from a dry academic point of
> view, which can be hard for the user base to really understand, or you can
> try to provide lots of stories that could be annotated by devs or improved
> with links to other stories.
>
> What I mean by stories is example data and how it it queried, and why it
> was stored (with partitions based on directories, options for "loading"
> data into directories, using partitions within the files, how Parquet
> optimizes so folks know where to put extra effort into typing etc.)
>
> As to your specific questions, I can't myself answer them,I've wondered
> about some myself, but haven't gotten to asking them. My experiences with
> Parquet have bene generally positive, but have involved a good amount of
> trial and error (as you can see from some of my user posts) (also, the user
> group has been great, but to my point about user stories, my education has
> come from posting stories and getting feedback from the community, it would
> neat to see this as a first class part of documentation, as I think it
> could help folks with Parquet, Drill and optimizing their environment.)
>
> Wish I could be of more help beyond +1 :)
>
>
>
> On Sun, Nov 1, 2015 at 1:48 AM, Stefán Baxter 
> wrote:
>
> > So we are off to a flying start :)
> >
> > On Thu, Oct 29, 2015 at 9:50 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We are using Avro, JSON and Parquet for collection various types of
> data
> > > for analytical processing.
> > >
> > > I have not used Parquet before we starting to play around with Drill
> and
> > > now I'm wondering if we are planing our data structures correctly and
> if
> > we
> > > will be able to get the most out of Drill+Parquet.
> > >
> > > I have some questions and I hope the answers can be turned into a Best
> > > Practices document.
> > >
> > > So here we go:
> > >
> > >- Are there any rules that we must abide by to make scanning of
> > >"low-cardinality" columns as effective as they can be?
> > >- I understand it so that the Parquet dictionary is scanned for the
> > >value(s) and if they are not in the dictionary that the section is
> > ignored
> > >
> > >- Can dictionary based scanning (as described above) work on arrays?
> > >- like: {"some":"simple","tags":["blue","green","yellow"]}
> > >
> > >- If I have multiple files containing a days worth of logging, in
> > >chronological order, will all the irrelevant files be ignored when
> > looking
> > >for a data or a date range?
> > >- AKA - Will the min-max headers in Parquet be used to prevent
> > >scanning of data outside the range?
> > >
> > >- Is there anything I need to do to make sure that the write
> > >optimizations in Parquet are used?
> > >- dictionaries for low cardinality fields
> > >- "number folding" for numerical sequences
> > >- compression etc.
> > >
> > >- Are there any Parquet features that are not available in Parquet?
> > >- I know Drill is using a fork of Parquet and I wonder if any major
> > >improvements in parquet are unavailable
> > >
> > >- Storing Dates with timezone information (stored in two separate
> > >fields?)
> > >- What is the common 

Re: directory structure containing multiple file types

2015-10-19 Thread Aman Sinha
With regard to the last comment on directory based pruning, please watch
DRILL-3759 (https://issues.apache.org/jira/browse/DRILL-3759).   I don't
have a timeline for it yet but hopefully
in the next Drill release.

Aman

On Mon, Oct 19, 2015 at 3:50 AM, Dhruv Gohil 
wrote:

> "What's needed in Drill to truly eliminate ETL" +1 but in another thread
> ;-)
> few 'hacks' we want to share there of our 'work rounds' related to
> various drill limitations on multi directory queries (99% of our workload)
> , e.g. avoiding empty directory failures, building queries with directory
> pruning that 'works' etc..
>
>
>
> On Monday 19 October 2015 01:47 PM, Stefán Baxter wrote:
>
>> Hi Ted,
>>
>> Your approach only works for a single directory, not a directory
>> structure.
>>
>> I will create an improvement request later today.
>>
>> I would welcome a session on "What's needed in Drill to truly eliminate
>> ETL" (Just an idea)
>>
>> Regards,
>>   -Stefan
>>
>> On Sun, Oct 18, 2015 at 10:30 PM, Stefán Baxter <
>> ste...@activitystream.com>
>> wrote:
>>
>> than you Jacques, I will.
>>>
>>> On Sun, Oct 18, 2015 at 10:01 PM, Jacques Nadeau 
>>> wrote:
>>>
>>> Stefan, can you open a JIRA for reading multiple files types in a single
 directory. It isn't the most common case we've run across but is
 definitely
 something that should be addressed.

 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Sat, Oct 17, 2015 at 10:33 AM, Stefán Baxter <
 ste...@activitystream.com>
 wrote:

 Thanks Abhishek,
>
> I think Drill is still quite far from eliminating ETL and the list of
> obstacles on the way to there seems growing. (yeah, disappointment got
>
 me

> for a bit)
>
> Regards,
>   -Stefan
>
>
>>>
>


Re: Performance issue when setting dir0 & dir1 in where clause

2015-10-13 Thread Aman Sinha
This is related to partition pruning and is being addressed as part of
DRILL-3759 (https://issues.apache.org/jira/browse/DRILL-3759).
Unfortunately, this issue did not make into the 1.2 version but will likely
be available with the next release.   Could you please add your above use
case in the JIRA ?  Thanks.

Aman

On Tue, Oct 13, 2015 at 8:57 AM,  wrote:

> Hi,
>
> I am doing a performance test on a Drill v1.1 query over directories. I
> have the following two queries which are effectively doing the same thing:
>
> SELECT count(*) FROM `stg`.`ws`.`./2015/10/13` where columns[7] like
> '%blah%' and columns[10] = ‘xx’
>
> SELECT count(*) FROM `stg`.`ws`.`./2015` where dir0='10' and dir1='13' and
> columns[7] like '%blah%' and columns[10] = ‘xx’
> I have noticed that the second query that includes ‘where dir0=’10’ and
> dir1=’13’ ‘ in the where clause takes ~29 seconds to run, however the first
> query takes ~8 seconds to run. The Fragment profiles show the same explain
> plan and very similar operator profiles. The second queries fragment
> profile starts from 22 seconds whereas the first profile starts from 1
> second.
>
> Can this performance issue be caused by the Calcite parser taking  longer
> to understand the where clause?
>
> Thanks,
> Mike
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>
>


Re: Naming directories

2015-09-07 Thread Aman Sinha
John,
If there are only 162 files in corresponding directories, it is likely the
time may be consumed in regexp_replace().  Did you try without the function
in your view and directly querying the underlying directory structure ?
>From what I understand you have a view for the Hive created tables but not
for the Drill created tables.. is that right ?

I did a quick test with TPC-DS data where the store_sales table is
partitioned into 301 directories based on ss_promo_sk column.   I did this
using Drill and shell script rather than Hive  - I used Drill CTAS to
create the 301 files and then moved them into respective directories using
a shell script.  I don't have access to Hive right now but I doubt that it
would make a huge difference.

The performance was generally in sub second range:
 0: jdbc:drill:zk=local> select count(1) from `ss_partition/ss3` where
ss_promo_sk > 99;
+--+
|  EXPR$0  |
+--+
| 1842941  |
+--+
1 row selected (0.596 seconds)


On Mon, Sep 7, 2015 at 3:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> John,
>
> What happens when you partition the data using drill in the same pattern as
> Hive uses?  Are the Drill produced parquet files still faster?
>
>
>
> On Mon, Sep 7, 2015 at 1:02 PM, John Omernik <j...@omernik.com> wrote:
>
> > So interestingly enough, this particular table doesn't have 4k files in
> it,
> > it's actually pretty small, in that there is only 1 file per partition.
> >  (tiny?)  Thus there are only 162 files vs. the 30 that drill created
> when
> > reprocessing the table. That probably doesn't help either given that this
> > is such small data, the planning takes more time than query.  It's cool
> > that the team is looking to improve this, I found the ability to just
> have
> > my data in Parquet partitioned by drill to be a huge win as well.  The
> > enhancements sound like they will enhance this even more, I would love to
> > see as close to native drill loaded parquet performance as possible with
> > Hive loaded tables, that would allow us to use drill to query, and hive
> to
> > load. (Using complex transforms, longer running queries etc).
> >
> > I love drill :)
> >
> >
> >
> > On Mon, Sep 7, 2015 at 2:23 PM, Aman Sinha <amansi...@apache.org> wrote:
> >
> > > Hi John,
> > > the partition pruning *planning* time is indeed a function of the
> number
> > of
> > > files in the table. The execution time is only dependent on the number
> of
> > > files in the specified partition.   In the Drill loaded Parquet files
> you
> > > had 30 files whereas in the Hive loaded parquet files you probably had
> > 162
> > > directories x 24 hours = about 4000 files  ? or somewhere in that
> > range...
> > >
> > > During the query planning phase, Drill partition pruning will load the
> > full
> > > pathnames of the files in memory, including materializing the
> > partitioning
> > > columns such as 'day'  into memory and apply the `day` >= '2015-01-01`
> > > filter.  It turns out this process is expensive when there are lots of
> > > files even if they are spread out over multiple directories.   I
> believe
> > > there's an enhancement JIRA to make this process efficient by loading
> > only
> > > directory names first and then the files...if not, I will create a
> JIRA.
> > >
> > > Note that partition pruning is still a huge win for more complex
> queries
> > > when the total execution time is substantially longer than the planning
> > > time.  It is only for shorter running queries against large number of
> > files
> > > where the planning times becomes more dominant.  There is ongoing
> effort
> > to
> > > improve that.
> > >
> > > Aman
> > >
> > >
> > > On Mon, Sep 7, 2015 at 10:15 AM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > > > As a follow-up to Jacques email, I did some testing with Parquet
> files
> > > > created and partitioned by Apache Hive. (Not using the metastore to
> > read
> > > > these files, just using the directories and reading the Parquet files
> > > > directly).
> > > >
> > > > Consider that Hive's partition scheme makes directories that have
> > > > partitionfield=partitionvalue as the directory name like this:
> > > >
> > > >
> > > > table
> > > > ---day=2015-09-06
> > > > ---hour=00
> > > > ---hour=01
> > > > ---hour=02
> > > > ---d

Re: query plan ....

2015-08-24 Thread Aman Sinha
Indeed, it is not efficient. We are doing 16 invocations of
CONVERT_FROMUTF8($1)  and 16 invocations of CONVERT_FROMUTF8($2).
Can you pls file a JIRA ?  We should ideally be doing projection pushdown
in conjunction with the filter pushdown in to the HBase scan and
computing these functions only once.

Aman


On Mon, Aug 24, 2015 at 1:34 PM, Sungwook Yoon sy...@maprtech.com wrote:

 Hi,

 I have a query, doing something like

 a in (v1, v2, v3,  v15)

 The physical query plan looks like the following.

 Filter(condition=[AND(=(CAST($0):INTEGER, 2009), =(CAST($0):INTEGER,
 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1),
 '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1), '42820'),
 =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'),
 =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'),
 =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'),
 =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'),
 =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'),
 =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'),
 =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'),
 =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'),
 =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'),
 =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'),
 =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'),
 =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'),
 =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'),
 =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), '4289')))]) :
 rowType = RecordType(ANY year, ANY DX1, ANY DX2): rowcount =
 3.300738791875E8, cumulative cost = {1.0562364134E10 rows,
 5.413211618675E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7136


 In this plan, does the drill convert to string the same column to multiple
 times as many as the values it is comparing against?

 From the performance, it looks like it is doing that ...

 Sungwook



Re: Drill dir0 issue

2015-08-23 Thread Aman Sinha
Sungwook, do you have the latest master build which has the fix for Hive
partition pruning (DRILL-3121) ?

On Sun, Aug 23, 2015 at 12:15 PM, Sungwook Yoon sy...@maprtech.com wrote:

 Will do,

 Thanks,

 Sungwook


 On Sun, Aug 23, 2015 at 2:14 PM, Jacques Nadeau jacq...@dremio.com
 wrote:

  It sounds like a bug. Can you file a jira?
 
  --
  Jacques Nadeau
  CTO and Co-Founder, Dremio
 
  On Sun, Aug 23, 2015 at 12:13 PM, Sungwook Yoon sy...@maprtech.com
  wrote:
 
   Hi Jacques,
  
   This works well, no problem of accessing the partitioned dirs.
   (and actually pretty faster than accessing from one level above)
  
   Just the issues I asked about, when I access from the
   /user/hive/warehouse/table, it somehow does not recover every dir0.
  
   Sungwook
  
  
   On Sun, Aug 23, 2015 at 2:02 PM, Jacques Nadeau jacq...@dremio.com
   wrote:
  
I think Hsuan misunderstood your question.
   
Can you let us know what you get if you query:
   
select * from dfs.root.`/user/hive/warehouse/table/year=2012`
   
   
   
   
   
   
   
--
Jacques Nadeau
CTO and Co-Founder, Dremio
   
On Sun, Aug 23, 2015 at 7:07 AM, Sungwook Yoon sy...@maprtech.com
   wrote:
   
 Hi,

 I am trying to use Hive parquet stored files partitioned by some
   column.
 So, the directory structure is partitioned with the column.

 The column is actually year.
 Let's say there are 5 years, so dir0 are like year=2010,
 year=2011,year=2012,year=2013,year=2014

 We did like following
 select * from dfs.root.`/user/hive/warehouse/table` d where d.dir0
 =
 'year=2012';

 I get nothing.
 Apparently, there are parquet files in the directory though.

 Sometimes it picks up e.g., year=2010, but not year=2012..

 Where am I going wrong with this?

 Thanks,

 Sungwook

   
  
 



Re: Help with optimizing a query

2015-08-10 Thread Aman Sinha
Parquet files of size 20GB seem too large... could you split it up into
smaller chunks ?   Drill's  Scan parallelism for Parquet is at the
granularity of files, so if you have several 256MB files you will get very
good parallelism as opposed to fewer large files.   The tradeoff is that
planning time would increase slightly since Drill has to read metadata from
many files.  However, overall the performance should improve.
For more detailed analysis, someone would have to look at the query
profile..

Aman

On Mon, Aug 10, 2015 at 3:53 PM, Sudheesh Katkam skat...@maprtech.com
wrote:

 Hi Yousef,

 If possible, could you put the profile in a publicly accessible location
 (Dropbox, etc) and post the link here?

 Thank you,
 Sudheesh

  On Aug 10, 2015, at 3:43 PM, Yousef Lasi yousef.l...@gmail.com wrote:
 
  We're running a 4 file join on a set of parquet files, the largest of
 which is about 20 GB in size. The query plan seems to indicate that most,
 if not all the time for the query (30 minutes) is spent on the first two
 major fragments. The physical plan looks like the output for these 2
 fragments below. I am not sure how to best interpret the results to
 optimize the query. It's pretty clear based on the plan output as well as
 the actual system resources utilized during execution that we are not CPU,
 Memory or I/O bound.  That doesn't leave a whole lot left to chase down.
 Any suggestions on where to look?
 
 
 00-00Screen : rowType = RecordType(ANY ACTION, ANY
 ID_TRAN_COLL, ANY ID_ACC, ANY ID_IMNT_STD, ANY CD_TYP_PSN, ANY DT_ACQS, ANY
 CD_TX_ACQS, ANY DMT_FED_COST_ACQ_PSN, ANY QY_SETT, ANY
 DMT_FED_COST_SETT_LCL, ANY DMT_UNIT_SETT_PSN, ANY QY_TOT, ANY
 DMT_FED_COST_TOT_CLNT, ANY DMT_FED_COST_TOT_LCL, ANY DMT_UNIT_TOT_PSN, ANY
 TRD_FEDERAL_TAX_COST_LOCL, ANY SET_FEDERAL_TAX_COST_LOCL, ANY
 DMT_TRD_GIFT_DATE, ANY DMT_TRD_FMKT_VAL_LOCL, ANY DMT_TRD_FMKT_VAL_BASE,
 ANY DMT_SET_GIFT_DATE, ANY DMT_SET_FMKT_VAL_LOCL, ANY
 DMT_SET_FMKT_VAL_BASE, ANY DMT_TRD_LOT_LEVEL_COV_IND, ANY
 DMT_SET_LOT_LEVEL_COV_IND, ANY DMT_TRD_TRANSFER_CODE, ANY
 DMT_SET_TRANSFER_CODE, ANY DMT_TRD_WASH_SALE_ACQ_DATE, ANY
 DMT_SET_WASH_SALE_ACQ_DATE, ANY DMT_TRD_LOT_DISALLOWED_DAYS, ANY
 DMT_SET_LOT_DISALLOWED_DAYS, ANY DMT_TRD_LOT_FMKT_USAGE_IND, ANY
 DMT_SET_LOT_FMKT_USAGE_IND, ANY DMT_SET_UGL_ST, ANY DMT_SET_UGL_ST_LCL, ANY
 DMT_SET_UGL_LT, ANY DMT_SET_UGL_LT_LCL, ANY DMT_TRD_UGL_ST, ANY
 DMT_TRD_UGL_ST_LCL, ANY DMT_TRD_UGL_LT, ANY DMT_TRD_UGL_LT_LCL, ANY
 ID_ACC_ALT, ANY CD_MTH_CSTNG, ANY CD_CCY_TRD_PRIM, ANY RT_SPOT, ANY
 IMNT_CLSS_PRICE_FCTR_RT, ANY ID_IMNT_CL, ANY DMT_TOT_MV_PSN, ANY
 TRD_INT_LOCL, ANY DMT_INCM_EXPCTD, ANY DMT_SETT_MV_PSN, ANY CD_CL_OMNI):
 rowcount = 4.061288430004E7, cumulative cost = {7.33055744629E8
 rows, 1.5711522516974182E10 cpu, 0.0 io, 2.5162863282995203E13 network,
 1.723515700962E10 memory}, id = 117134 00-01  Project(ACTION=[$0],
 ID_TRAN_COLL=[$1], ID_ACC=[$2], ID_IMNT_STD=[$3], CD_TYP_PSN=[$4],
 DT_ACQS=[$5], CD_TX_ACQS=[$6], DMT_FED_COST_ACQ_PSN=[$7], QY_SETT=[$8],
 DMT_FED_COST_SETT_LCL=[$9], DMT_UNIT_SETT_PSN=[$10], QY_TOT=[$11],
 DMT_FED_COST_TOT_CLNT=[$12], DMT_FED_COST_TOT_LCL=[$13],
 DMT_UNIT_TOT_PSN=[$14], TRD_FEDERAL_TAX_COST_LOCL=[$15],
 SET_FEDERAL_TAX_COST_LOCL=[$16], DMT_TRD_GIFT_DATE=[$17],
 DMT_TRD_FMKT_VAL_LOCL=[$18], DMT_TRD_FMKT_VAL_BASE=[$19],
 DMT_SET_GIFT_DATE=[$20], DMT_SET_FMKT_VAL_LOCL=[$21],
 DMT_SET_FMKT_VAL_BASE=[$22], DMT_TRD_LOT_LEVEL_COV_IND=[$23],
 DMT_SET_LOT_LEVEL_COV_IND=[$24], DMT_TRD_TRANSFER_CODE=[$25],
 DMT_SET_TRANSFER_CODE=[$26], DMT_TRD_WASH_SALE_ACQ_DATE=[$27],
 DMT_SET_WASH_SALE_ACQ_DATE=[$28], DMT_TRD_LOT_DISALLOWED_DAYS=[$29],
 DMT_SET_LOT_DISALLOWED_DAYS=[$30], DMT_TRD_LOT_FMKT_USAGE_IND=[$31],
 DMT_SET_LOT_FMKT_USAGE_IND=[$32], DMT_SET_UGL_ST=[$33],
 DMT_SET_UGL_ST_LCL=[$34], DMT_SET_UGL_LT=[$35], DMT_SET_UGL_LT_LCL=[$36],
 DMT_TRD_UGL_ST=[$37], DMT_TRD_UGL_ST_LCL=[$38], DMT_TRD_UGL_LT=[$39],
 DMT_TRD_UGL_LT_LCL=[$40], ID_ACC_ALT=[$41], CD_MTH_CSTNG=[$42],
 CD_CCY_TRD_PRIM=[$43], RT_SPOT=[$44], IMNT_CLSS_PRICE_FCTR_RT=[$45],
 ID_IMNT_CL=[$46], DMT_TOT_MV_PSN=[$47], TRD_INT_LOCL=[$48],
 DMT_INCM_EXPCTD=[$49], DMT_SETT_MV_PSN=[$50], CD_CL_OMNI=[$51]) : rowType =
 RecordType(ANY ACTION, ANY ID_TRAN_COLL, ANY ID_ACC, ANY ID_IMNT_STD, ANY
 CD_TYP_PSN, ANY DT_ACQS, ANY CD_TX_ACQS, ANY DMT_FED_COST_ACQ_PSN, ANY
 QY_SETT, ANY DMT_FED_COST_SETT_LCL, ANY DMT_UNIT_SETT_PSN, ANY QY_TOT, ANY
 DMT_FED_COST_TOT_CLNT, ANY DMT_FED_COST_TOT_LCL, ANY DMT_UNIT_TOT_PSN, ANY
 TRD_FEDERAL_TAX_COST_LOCL, ANY SET_FEDERAL_TAX_COST_LOCL, ANY
 DMT_TRD_GIFT_DATE, ANY DMT_TRD_FMKT_VAL_LOCL, ANY DMT_TRD_FMKT_VAL_BASE,
 ANY DMT_SET_GIFT_DATE, ANY DMT_SET_FMKT_VAL_LOCL, ANY
 DMT_SET_FMKT_VAL_BASE, ANY DMT_TRD_LOT_LEVEL_COV_IND, ANY
 DMT_SET_LOT_LEVEL_COV_IND, ANY DMT_TRD_TRANSFER_CODE, ANY
 DMT_SET_TRANSFER_CODE, ANY DMT_TRD_WASH_SALE_ACQ_DATE, ANY
 DMT_SET_WASH_SALE_ACQ_DATE, ANY DMT_TRD_LOT_DISALLOWED_DAYS, ANY
 DMT_SET_LOT_DISALLOWED_DAYS, ANY 

Re: Drill Hangout (2015-08-04) minutes

2015-08-04 Thread Aman Sinha
The hangout notes refer to dot drill file but I think that may be either
misrepresentation or mis-statement during the hangout.  For the INSERT
discussion, the best source is the separate thread titled '[DISCUSS] Insert
Into Table Support'.  In fact, we are intending to keep the merged schema
in the parquet metadata cache file (which is different from dot drill).
Let's discuss the issues around concurrency (inserts concurrent with reads)
in the other email thread.

Aman

On Tue, Aug 4, 2015 at 7:18 PM, Jacques Nadeau jacq...@dremio.com wrote:

 Quick thought on insert isolation:

 Let's just do a hidden directory and then rename.  We can make Drill avoid
 reading hidden directories.  No fancy work required.

 With regards to Dot Drill, let's not turn this into a mini database.  The
 complexities would be overwhelming.  My recommendation is we constrain to
 additional metadata that cannot otherwise be divined.  Beyond that, we're
 should use ephemeral files (similar to the parquet metadata cache where
 deleting doesn't impact logical outcome, may impact planning or
 performance).  I would avoid mixing ephemeral and persistent data around
 the dot drill concept.

 In general, if we want to store Drill's internal ephemeral metadata, lets
 have a discussion around the options.  Also remember that not all Drill
 installations will use a distributed filesystem.  As such, we need to think
 about these abstractions to support multiple types of storage systems.



 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Tue, Aug 4, 2015 at 2:24 PM, Khurram Faraaz kfar...@maprtech.com
 wrote:

  Drill Hangout 2015-08-04
 
  Attendees: Daniel, Khurram, Neeraja, Vicky, Kris, Aman, Parth, Andries,
  Jinfeng, Anas
 
  - Insert and drop
 
  - read isolation during insert into, Aman suggested snapshot level
 
  - have to have some kind of lock manager
 
  - locking on the dot drill file
 
  - should this locking talk to other external programs working with Drill
  used by Drill?
 
  - Jason - why is this the lock necessary?
 
  - we want to merge schemas in a dot drill file, avoid gather schemas
  from alot of separate files
 
  - insert feature will be broken into phases
 
  - this needs to handle schema changes to be consistent with the rest
 of
  Drill
 
  - partition pruning is not working for some expressions
 
  - we will only fix for cast
 
  - Jinfeng thinks this should be easy enough
 
  - handling unknown types in parquet or other external systems
 
  - should we fail actively, or should we give data back in varbinary
 
  - sould people have to wait for a release to handle new data types
 
  - storage plugin writers should have a clear idea about how to handle
 these
  cases
 
  - Jason will send a message to the list about this
 
 
  - test framework
 
  - Rahul is working on publishing it to a public repository
 
  - this will include instructions on how to set up the tests on your own
  hardware
 



Re: Drill making wrong type decision on comparison in where clause

2015-07-30 Thread Aman Sinha
BTW, there was a discussion earlier today on the dev list regarding a
related issue: see the thread titled:  'Count where or having clause does
not work as expected'.
Drill does need to improve error messaging for such types of queries...this
is a known usability issue and hopefully we can resolve it in the near
future.

Aman

On Thu, Jul 30, 2015 at 3:48 PM, Aman Sinha asi...@maprtech.com wrote:

 Hi John,
 you cannot use aliases in the WHERE condition.  Drill is not unique in
 this restriction...since the WHERE condition is evaluated before the alias
 is done in the SELECT clause.   Did you try WHERE t.app.hcc.event_name IN
 ('logout') ?

 Aman

 On Thu, Jul 30, 2015 at 3:42 PM, John Schneider jschnei...@apixio.com
 wrote:

 select t.app.hcc.event_name as en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('logout');

 this yields the error:
   Error: SYSTEM ERROR: NumberFormatException: logout

 ok, so let's explicitly cast

 select cast(convert_from(t.app.hcc.event_name, 'UTF8') as varchar(30)) as
 en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('logout');

 now, just to humor drill

 select cast(convert_from(t.app.hcc.event_name, 'UTF8') as varchar(30)) as
 en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('123');

 runs, but returns no results - as would be expected because we don't use
 #s
 as event names

 Am I misunderstanding how drill types data in a schema less record?
 I would have thought the explicit cast would have been enough


 P.S. I ran another query like this one on a months worth of logs (a lot of
 json in HDFS) and it chewed through it in less time than it takes my
 current Hive query to actually start, and all of this on a single aws
 m3.xlarge - this drill sucker is fast, we really want to use it.

 john o schneider
 j...@apixio.com
 408-203-7891





Re: Drill making wrong type decision on comparison in where clause

2015-07-30 Thread Aman Sinha
Hi John,
you cannot use aliases in the WHERE condition.  Drill is not unique in this
restriction...since the WHERE condition is evaluated before the alias is
done in the SELECT clause.   Did you try WHERE t.app.hcc.event_name IN
('logout') ?

Aman

On Thu, Jul 30, 2015 at 3:42 PM, John Schneider jschnei...@apixio.com
wrote:

 select t.app.hcc.event_name as en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('logout');

 this yields the error:
   Error: SYSTEM ERROR: NumberFormatException: logout

 ok, so let's explicitly cast

 select cast(convert_from(t.app.hcc.event_name, 'UTF8') as varchar(30)) as
 en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('logout');

 now, just to humor drill

 select cast(convert_from(t.app.hcc.event_name, 'UTF8') as varchar(30)) as
 en
 from dfs.`user`.`/logmaster/production/hcc/2015-07-30/*.json` t
 where en in ('123');

 runs, but returns no results - as would be expected because we don't use #s
 as event names

 Am I misunderstanding how drill types data in a schema less record?
 I would have thought the explicit cast would have been enough


 P.S. I ran another query like this one on a months worth of logs (a lot of
 json in HDFS) and it chewed through it in less time than it takes my
 current Hive query to actually start, and all of this on a single aws
 m3.xlarge - this drill sucker is fast, we really want to use it.

 john o schneider
 j...@apixio.com
 408-203-7891



Re: Counting large numbers of unique values

2015-04-07 Thread Aman Sinha
Drill already does most of this type of transformation.  If you do an
'EXPLAIN PLAN FOR your count(distinct) query'
you will see that it first does a grouping on the column and then applies
the COUNT(column).  The first level grouping can be done either based on
sorting or hashing and this is configurable through a system option.

Aman

On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski mkarpin...@opera.com
wrote:

 Hi Guys,

 I have a specific use case for Drill, in which I'd like to be able to count
 unique values in columns with tens millions of distinct values. The COUNT
 DISTINCT method, unfortunately, does not scale both time- and memory-wise
 and the idea is to sort the data beforehand by the values of that column
 (let's call it ID), to have the row groups split at new a new ID boundary
 and to extend Drill with an alternative version of COUNT that would simply
 count the number of times the ID changes through out the entire table. This
 way, we could expect that counting unique values of pre-sorted columns
 could have complexity comparable to that of the regular COUNT operator (a
 full scan). So, to sum up, I have three questions:

 1. Can such a scenario be realized in Drill?
 2. Can it be done in a modular way (eg, a dedicated UDAF or an operator),
 so without heavy hacking throughout entire Drill?
 3. How to do it?

 Our initial experience with Drill was very good - it's an excellent tool.
 But in order to be able to adopt it, we need to sort out this one central
 issue.

 Cheers,
 Marcin



Re: CSV header issue

2015-04-02 Thread Aman Sinha
Hi Mahesh,
Please see https://issues.apache.org/jira/browse/DRILL-951  for the issue
of CSV headers.  It is a feature that will be addressed in an upcoming
release (currently tagged for 1.0).

Aman

On Wed, Apr 1, 2015 at 10:52 PM, Mahesh Sankaran sankarmahes...@gmail.com
wrote:

 Hi ,
  I am currently working in Apache Drill to analyse CSV files.My
 problem is, If the CSV file has headers means we cant do any sum query.It
 shows the following errors.

 0: jdbc:drill:zk=nn01:2181,dn02:2181,dn03:218 select sum(cast(columns[6]
 as int)) from HDFS.`/test.csv` limit 10;
 Query failed: RemoteRpcException: Failure while running fragment., rcvdbyte
 [ 584925d6-dab6-42ce-8eb3-fa7abfb0e0f2 on nn01:31010 ]
 [ 584925d6-dab6-42ce-8eb3-fa7abfb0e0f2 on nn01:31010 ]


 Error: exception while executing query: Failure while executing query.
 (state=,code=0)

 *But the above query is working well without headers.There is any way to
 sum the columns in CSV files with headers in Apache Drill.*

 *This is our example file:*
 0: jdbc:drill:zk=nn01:2181,dn02:2181,dn03:218 select * from
 HDFS.`/test.csv` limit 10;
 +++
 |  columns   |dir0|
 +++
 | [date1,time1,srcip,dstip,service,sentbyte,rcvdbyte] |
 nn01:9000  |
 | [2015-01-01,00:00:00,10.10.100.74,192.168.0.12,DNS,0,193] |
 nn01:9000  |
 | [2015-01-01,00:00:00,10.10.100.74,192.168.0.12,DNS,0,166] |
 nn01:9000  |
 | [2015-01-01,00:00:00,10.10.100.74,192.168.0.12,DNS,60,359]
 | nn01:9000  |
 |

 [2015-01-01,00:00:00,10.10.50.195,106.10.193.45,php,717,359,0,0]
 | nn01:9000  |
 | [2015-01-01,00:00:00,111.123.180.44,117.239.67.36,9064,0,0]
 | nn01:9000  |
 | [2015-01-01,00:00:00,111.123.180.44,117.239.67.37,9064,0,0]
 | nn01:9000  |
 | [2015-01-01,00:00:00,111.123.180.44,117.239.67.38,9064,0,0]
 | nn01:9000  |
 | [2015-01-01,00:00:00,111.123.180.44,117.239.67.34,9064,0,0]
 | nn01:9000  |
 | [2015-01-01,00:00:00,111.123.180.44,117.239.67.44,9064,0,0]
 | nn01:9000  |


 Thanks and Regards,

 Mahesh Sankaran



Re: Directory pruning with Drill

2015-02-04 Thread Aman Sinha
Andries,  the dir0=2015  with or without quotes should do the pruning.
However, if the value is not numeric, such as 'Qtr1' then you need quotes.
Here's my query which shows the pruning is being done:

(there are 3 directories: 1994, 1995, 1996 and query selects one of them
and Q1 within that directory):

0: jdbc:drill:zk=local explain plan for select * from
dfs.`/Users/asinha/data/multilevel/parquet` where dir0=1995 and dir1='Q1';
+++
|text|json|
+++
| 00-00Screen
00-01  Project(*=[$0])
00-02Project(*=[$0])
00-03  Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=/Users/asinha/data/multilevel/parquet/1995/Q1/orders_95_q1.parquet]],
selectionRoot=/Users/asinha/data/multilevel/parquet, numFiles=1,
columns=[`*`]]])

On Wed, Feb 4, 2015 at 8:06 AM, Andries Engelbrecht 
aengelbre...@maprtech.com wrote:

 Jason,

 Thanks for the clarification, I added these to the enhancement request.

 —Andries


 On Feb 3, 2015, at 9:47 PM, Jason Altekruse altekruseja...@gmail.com
 wrote:

  Hao,
 
  The dir columns are always added to the records coming out of a scan. The
  issue is with trying to avoid unneeded reads altogether. If you look at
 the
  query plan you should see that the scan is going to read all of the files
  and the filter against the directory column will be applied in a separate
  filter operation later. Currently we only support simple expressions,
  either equality or an in-list to specify partition filters that can be
  pushed into the scan operation.
 
  -Jason
 
  On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu h...@maprtech.com wrote:
 
  Strange, per my testing, we can do that:
 
  0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181 select * from `hao/2015`
 where
  dir0=1;
  +++
  |  columns   |dir0|
  +++
  | [1,2,3] | 1  |
  +++
  1 row selected (0.098 seconds)
  0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181 select * from `hao/2015`
 where
  dir01;
  +++
  |  columns   |dir0|
  +++
  | [1,2,3] | 3  |
  | [1,2,3] | 2  |
  +++
  2 rows selected (0.18 seconds)
 
  Thanks,
  Hao
 
  On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran tshi...@gmail.com wrote:
 
  The casting issue seems like a real bug. People want to do things like
  dir0  2012
 
  On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht 
  aengelbre...@maprtech.com wrote:
 
  Thanks.
 
  It will be good for users to understand the specifics of directory
  pruning.
 
  As an additional note is is important to not cast the data typeof the
  dir
  filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
  properly.
  With dir0=2015 the query to works, but the directories are no pruned
 
  Similar if a view is created with columns for dir0, dir1, etc. the
 data
  types should not be casted or converted, based on current
 observations.
 
  It may be good to make it a bit friendlier for a better user
  experience,
  will file an enhancement request.
 
  —Andries
 
 
  On Feb 3, 2015, at 5:35 PM, Aman Sinha asi...@maprtech.com wrote:
 
  Yes, that's the expected behavior for now.  Directory pruning where
  only
  subdirectory is specified is logically equivalent to wildcard
  matching
  -
  '*/*/10'  which is not supported yet.  You could open an enhancement
  request.
 
  On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht 
  aengelbre...@maprtech.com wrote:
 
  Is it required for the directory pruning to work that a top down
  filter
  of
  directories be applied?
 
  My current observation is that for a directory structure as listed
  below,
  the pruning only works if the full tree is provided. If only a lower
  level
  directory is supplied in the filter condition Drill only uses it as
  a
  filter.
 
  /2015
 /01
/10
/11
/12
/13
/14
 
  select count(id) from `/foo` t where dir0='2015' and dir1='01' and
  dir2='10'
  Produces the correct pruning and query plan
  01-02Project(id=[$3]): rowcount = 3670316.0, cumulative
  cost =
  {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0
  memory},
  id =
  28434
  01-03  Project(dir0=[$0], dir1=[$3], dir2=[$2],
  id=[$1]):
  rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7
  cpu,
  0.0 io, 0.0 network, 0.0 memory}, id = 28433
  01-04Scan(groupscan=[EasyGroupScan
  [selectionRoot=/foo,
  numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
 
 
  However
  select count(id) from `/foo` t where dir2='10'
  Produces full scan of all sub directories and only applies a filter
  condition after the fact. Notice the numFiles between the 2, even
  though it
  lists columns in the base scan
  01-04Filter(condition=[=($0, '10')]): rowcount =
  9423761.7, cumulative