Re: November Apache Drill board report

2018-11-07 Thread Padma Penumarthy
1.14 release notes should mention batch sizing also.

For the board report, can you please add the following :
Batch processing improvements to limit the amount of memory for Hash Join,
Union All, Project, Hash Aggregate and Nested Loop Join.

Just FYI for everyone. Here is the link to the document which has details
about batch sizing work:
https://docs.google.com/document/d/1Z-67Y_KNcbA2YYWCHEwf2PUEmXRPWSXsw-CHnXW_98Q/

Thanks
Padma





On Wed, Nov 7, 2018 at 5:56 AM Arina Ielchiieva  wrote:

> Hi Padma,
>
> I can include mention about batch sizing but I am not sure what I should
> mention, quick search over release notes shows a couple of changes related
> to batch sizing:
> https://drill.apache.org/docs/apache-drill-1-14-0-release-notes/
> Could you please propose what I should include?
>
> @PMCs and committers
> Only one PMC member has given +1 for the report. Could more folks please
> review the report?
>
> Kind regards,
> Arina
>
> On Fri, Nov 2, 2018 at 8:33 PM Padma Penumarthy <
> penumarthy.pa...@gmail.com>
> wrote:
>
> > Hi Arina,
> >
> > Can you add batch sizing (for bunch of operators and parquet reader)
> also ?
> >
> > Thanks
> > Padma
> >
> >
> > On Fri, Nov 2, 2018 at 2:55 AM Arina Ielchiieva 
> wrote:
> >
> > > Sure, let's mention.
> > > Updated the report.
> > >
> > > =
> > >
> > >  ## Description:
> > >  - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> > > Storage.
> > >
> > > ## Issues:
> > >  - There are no issues requiring board attention at this time.
> > >
> > > ## Activity:
> > >  - Since the last board report, Drill has released version 1.14.0,
> > >including the following enhancements:
> > > - Drill in a Docker container
> > > - Image metadata format plugin
> > > - Upgrade to Calcite 1.16.0
> > > - Kafka plugin push down support
> > > - Phonetic and String functions
> > > - Enhanced decimal data support
> > > - Spill to disk for the Hash Join support
> > > - CGROUPs resource management support
> > > - Lateral / Unnest support (disabled by default)
> > >  - There were active discussions about schema provision in Drill.
> > >Based on these discussions two projects are currently evolving:
> > >Drill metastore and schema provision in the file and in a query.
> > >  - Apache Drill book has been written by two PMC members (Charles and
> > > Paul).
> > >  - Drill developer meet up will be held on November 14, 2018.
> > >
> > >The following areas are going to be discussed:
> > > - Storage plugins
> > > - Schema discovery & Evolution
> > > - Metadata Management
> > > - Resource management
> > > - Integration with Apache Arrow
> > >
> > > ## Health report:
> > >  - The project is healthy. Development activity
> > >as reflected in the pull requests and JIRAs is good.
> > >  - Activity on the dev and user mailing lists are stable.
> > >  - Three committers and three new PMC member were added in the last
> > period.
> > >
> > > ## PMC changes:
> > >
> > >  - Currently 23 PMC members.
> > >  - New PMC members:
> > > - Boaz Ben-Zvi was added to the PMC on Fri Aug 17 2018
> > > - Charles Givre was added to the PMC on Mon Sep 03 2018
> > > - Vova Vysotskyi was added to the PMC on Fri Aug 24 2018
> > >
> > > ## Committer base changes:
> > >
> > >  - Currently 48 committers.
> > >  - New commmitters:
> > > - Chunhui Shi was added as a committer on Thu Sep 27 2018
> > > - Gautam Parai was added as a committer on Mon Oct 22 2018
> > > - Weijie Tong was added as a committer on Fri Aug 31 2018
> > >
> > > ## Releases:
> > >
> > >  - 1.14.0 was released on Sat Aug 04 2018
> > >
> > > ## Mailing list activity:
> > >
> > >  - d...@drill.apache.org:
> > > - 427 subscribers (down -6 in the last 3 months):
> > > - 2827 emails sent to list (2126 in previous quarter)
> > >
> > >  - iss...@drill.apache.org:
> > > - 18 subscribers (down -1 in the last 3 months):
> > > - 3487 emails sent to list (4769 in previous quarter)
> > >
> > >  - user@drill.apache.org:
> > > - 597 subscribers (down -6 in the last 3 months):
> > > - 332 emails sent to list (346 in previous quarter)
> > >
> > &

Re: November Apache Drill board report

2018-11-02 Thread Padma Penumarthy
Hi Arina,

Can you add batch sizing (for bunch of operators and parquet reader) also ?

Thanks
Padma


On Fri, Nov 2, 2018 at 2:55 AM Arina Ielchiieva  wrote:

> Sure, let's mention.
> Updated the report.
>
> =
>
>  ## Description:
>  - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> Storage.
>
> ## Issues:
>  - There are no issues requiring board attention at this time.
>
> ## Activity:
>  - Since the last board report, Drill has released version 1.14.0,
>including the following enhancements:
> - Drill in a Docker container
> - Image metadata format plugin
> - Upgrade to Calcite 1.16.0
> - Kafka plugin push down support
> - Phonetic and String functions
> - Enhanced decimal data support
> - Spill to disk for the Hash Join support
> - CGROUPs resource management support
> - Lateral / Unnest support (disabled by default)
>  - There were active discussions about schema provision in Drill.
>Based on these discussions two projects are currently evolving:
>Drill metastore and schema provision in the file and in a query.
>  - Apache Drill book has been written by two PMC members (Charles and
> Paul).
>  - Drill developer meet up will be held on November 14, 2018.
>
>The following areas are going to be discussed:
> - Storage plugins
> - Schema discovery & Evolution
> - Metadata Management
> - Resource management
> - Integration with Apache Arrow
>
> ## Health report:
>  - The project is healthy. Development activity
>as reflected in the pull requests and JIRAs is good.
>  - Activity on the dev and user mailing lists are stable.
>  - Three committers and three new PMC member were added in the last period.
>
> ## PMC changes:
>
>  - Currently 23 PMC members.
>  - New PMC members:
> - Boaz Ben-Zvi was added to the PMC on Fri Aug 17 2018
> - Charles Givre was added to the PMC on Mon Sep 03 2018
> - Vova Vysotskyi was added to the PMC on Fri Aug 24 2018
>
> ## Committer base changes:
>
>  - Currently 48 committers.
>  - New commmitters:
> - Chunhui Shi was added as a committer on Thu Sep 27 2018
> - Gautam Parai was added as a committer on Mon Oct 22 2018
> - Weijie Tong was added as a committer on Fri Aug 31 2018
>
> ## Releases:
>
>  - 1.14.0 was released on Sat Aug 04 2018
>
> ## Mailing list activity:
>
>  - d...@drill.apache.org:
> - 427 subscribers (down -6 in the last 3 months):
> - 2827 emails sent to list (2126 in previous quarter)
>
>  - iss...@drill.apache.org:
> - 18 subscribers (down -1 in the last 3 months):
> - 3487 emails sent to list (4769 in previous quarter)
>
>  - user@drill.apache.org:
> - 597 subscribers (down -6 in the last 3 months):
> - 332 emails sent to list (346 in previous quarter)
>
>
> ## JIRA activity:
>
>  - 164 JIRA tickets created in the last 3 months
>  - 128 JIRA tickets closed/resolved in the last 3 months
>
>
>
> On Fri, Nov 2, 2018 at 12:25 AM Sorabh Hamirwasia 
> wrote:
>
> > Hi Arina,
> > Lateral/Unnest feature was part of 1.14 though it was disabled by
> default.
> > Should we mention it as part of 1.14 enhancements in the report?
> >
> > Thanks,
> > Sorabh
> >
> > On Thu, Nov 1, 2018 at 9:29 AM Arina Yelchiyeva <
> > arina.yelchiy...@gmail.com>
> > wrote:
> >
> > > Thanks, Aman!  Updated the report.
> > > I went too far with 2019, luckily the meet up will be much earlier :)
> > >
> > > =
> > >
> > >  ## Description:
> > >  - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> > > Storage.
> > >
> > > ## Issues:
> > >  - There are no issues requiring board attention at this time.
> > >
> > > ## Activity:
> > >  - Since the last board report, Drill has released version 1.14.0,
> > >including the following enhancements:
> > > - Drill in a Docker container
> > > - Image metadata format plugin
> > > - Upgrade to Calcite 1.16.0
> > > - Kafka plugin push down support
> > > - Phonetic and String functions
> > > - Enhanced decimal data support
> > > - Spill to disk for the Hash Join support
> > > - CGROUPs resource management support
> > >  - There were active discussions about schema provision in Drill.
> > >Based on these discussions two projects are currently evolving:
> > >Drill metastore and schema provision in the file and in a query.
> > >  - Apache Drill book has been written by two PMC members (Charles and
> > > Paul).
> > >  - Drill developer meet up will be held on November 14, 2018.
> > >The following areas are going to be discussed:
> > > - Storage plugins
> > > - Schema discovery & Evolution
> > > - Metadata Management
> > > - Resource management
> > > - Integration with Apache Arrow
> > >
> > > ## Health report:
> > >  - The project is healthy. Development activity
> > >as reflected in the pull requests and JIRAs is good.
> > >  - Activity on the dev and user mailing lists are stable.
> > >  - Three committers and three new PMC member were added in the last
> > period.
> > >
> > > ## PMC changes:
> > >
> > >  - Currently 23 PMC 

Re: [ANNOUNCE] New Committer: Hanumath Rao Maduri

2018-11-01 Thread Padma Penumarthy
Congratulations Hanu.

Thanks
Padma


On Thu, Nov 1, 2018 at 7:44 PM weijie tong  wrote:

> Congratulations, Hanu!
>
> On Fri, Nov 2, 2018 at 8:22 AM Robert Hou  wrote:
>
> > Congratulations, Hanu.  Thanks for contributing to Drill.
> >
> > --Robert
> >
> > On Thu, Nov 1, 2018 at 4:06 PM Jyothsna Reddy 
> > wrote:
> >
> > > Congrats Hanu!! Well deserved :D
> > >
> > > Thank you,
> > > Jyothsna
> > >
> > > On Thu, Nov 1, 2018 at 2:15 PM Sorabh Hamirwasia  >
> > > wrote:
> > >
> > > > Congratulations Hanu!
> > > >
> > > > Thanks,
> > > > Sorabh
> > > >
> > > > On Thu, Nov 1, 2018 at 1:35 PM Hanumath Rao Maduri <
> hanu@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Thank you all for the wishes!
> > > > >
> > > > > Thanks,
> > > > > -Hanu
> > > > >
> > > > > On Thu, Nov 1, 2018 at 1:28 PM Chunhui Shi  > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Congratulations Hanu!
> > > > > >
> --
> > > > > > From:Arina Ielchiieva 
> > > > > > Send Time:2018 Nov 1 (Thu) 06:05
> > > > > > To:dev ; user 
> > > > > > Subject:[ANNOUNCE] New Committer: Hanumath Rao Maduri
> > > > > >
> > > > > > The Project Management Committee (PMC) for Apache Drill has
> invited
> > > > > > Hanumath
> > > > > > Rao Maduri to become a committer, and we are pleased to announce
> > that
> > > > he
> > > > > > has accepted.
> > > > > >
> > > > > > Hanumath became a contributor in 2017, making changes mostly in
> the
> > > > Drill
> > > > > > planning side, including lateral / unnest support. He is also one
> > of
> > > > the
> > > > > > contributors of index based planning and execution support.
> > > > > >
> > > > > > Welcome Hanumath, and thank you for your contributions!
> > > > > >
> > > > > > - Arina
> > > > > > (on behalf of Drill PMC)
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Committer: Gautam Parai

2018-10-22 Thread Padma Penumarthy
Congratulations Gautam.

Thanks
Padma


On Mon, Oct 22, 2018 at 7:25 AM Arina Ielchiieva  wrote:

> The Project Management Committee (PMC) for Apache Drill has invited Gautam
> Parai to become a committer, and we are pleased to announce that he has
> accepted.
>
> Gautam has become a contributor since 2016, making changes in various Drill
> areas including planning side. He is also one of the contributors of the
> upcoming feature to support index based planning and execution.
>
> Welcome Gautam, and thank you for your contributions!
>
> - Arina
> (on behalf of Drill PMC)
>


Re: need understanding of drill affinity configurations

2018-07-05 Thread Padma Penumarthy
Did you find anything in the logs which might indicate why the drillbit is
down or not responding ?
Try jstack on the drillbit process to see what it is busy with when they
are shown as
down in web UI.

Thanks
Padma


On Tue, Jul 3, 2018 at 11:22 PM, Divya Gehlot 
wrote:

> Thanks Padma !
> The issue is  strangely some of the drillbits goes down but when I check
> the processes, drillbit process would be running.
> But when check in drillbit web UI it doesn't display those down drillbits .
> I want to debug ,what's causing the issue as the process is still running.
> To make the drillbit run again I have to kill the process and restart the
> drillbits again on the impacted nodes and it's happening frequently .
> How can I debug what's making drillbit down on web UI whereas the process
> still running ?
>
> Thanks,
> Divya
>
> On Wed, 4 Jul 2018 at 12:52, Padma Penumarthy 
> wrote:
>
> > max.width.per.endpoint - how many minor fragments you can have per major
> > fragment on a single node
> >
> > global.max.width - max number of minor fragments per major fragment you
> can
> > have across all nodes
> >
> > affinity.factor - When deciding how many minor fragments to schedule on
> > each node,
> > this is the factor by which you favor nodes which have data locality vs.
> > nodes which do not.
> >
> > executor.threads -  Seems like maximum number of threads we can have ? I
> am
> > not very sure about this one.
> >
> > These messages are debug messages, not errors.
> >
> > So, what is the problem you have ?
> >
> > Thanks
> > Padma
> >
> >
> >
> > On Tue, Jul 3, 2018 at 9:00 PM, Divya Gehlot 
> > wrote:
> >
> > > Hi,
> > > I would like to understand below configurations :
> > >  work: {
> > > max.width.per.endpoint: 5,
> > > global.max.width: 100,
> > > affinity.factor: 1.2,
> > > executor.threads: 4
> > >   },
> > >
> > > As I am getting below error at time in some of the Drill bits :
> > > 2018-07-03 04:08:45,725 [pool-245-thread-13] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,731 [pool-245-thread-11] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,732 [pool-245-thread-12] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,732 [pool-245-thread-14] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,733 [pool-245-thread-3] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,733 [pool-245-thread-16] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > > 2018-07-03 04:08:45,735 [pool-245-thread-15] INFO
> > > o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running
> on
> > > host   Skipping affinity to that host.
> > >
> > > Is there any relationship between affinity configurations and Drill
> logs
> > ?
> > > Appreciate if anybody can help me understand !
> > >
> > > Thanks,
> > > Divya
> > >
> >
>


Re: need understanding of drill affinity configurations

2018-07-03 Thread Padma Penumarthy
max.width.per.endpoint - how many minor fragments you can have per major
fragment on a single node

global.max.width - max number of minor fragments per major fragment you can
have across all nodes

affinity.factor - When deciding how many minor fragments to schedule on
each node,
this is the factor by which you favor nodes which have data locality vs.
nodes which do not.

executor.threads -  Seems like maximum number of threads we can have ? I am
not very sure about this one.

These messages are debug messages, not errors.

So, what is the problem you have ?

Thanks
Padma



On Tue, Jul 3, 2018 at 9:00 PM, Divya Gehlot 
wrote:

> Hi,
> I would like to understand below configurations :
>  work: {
> max.width.per.endpoint: 5,
> global.max.width: 100,
> affinity.factor: 1.2,
> executor.threads: 4
>   },
>
> As I am getting below error at time in some of the Drill bits :
> 2018-07-03 04:08:45,725 [pool-245-thread-13] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,731 [pool-245-thread-11] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,732 [pool-245-thread-12] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,732 [pool-245-thread-14] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,733 [pool-245-thread-3] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,733 [pool-245-thread-16] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
> 2018-07-03 04:08:45,735 [pool-245-thread-15] INFO
> o.a.d.e.s.schedule.BlockMapBuilder - Failure finding Drillbit running on
> host   Skipping affinity to that host.
>
> Is there any relationship between affinity configurations and Drill logs ?
> Appreciate if anybody can help me understand !
>
> Thanks,
> Divya
>


Re: unit tests

2018-07-01 Thread Padma Penumarthy
turned out to be a firewall issue. Thanks everyone for all the suggestions
and help.

Thanks
Padma


On Sun, Jul 1, 2018 at 3:27 PM, Padma Penumarthy  wrote:

> Already tried that with no luck.
>
> Thanks
> Padma
>
>
>
> On Sun, Jul 1, 2018 at 3:21 PM, Vitalii Diravka  > wrote:
>
>> Hi Padma,
>>
>> Looks like you have wrong some hostname or ip in your /etc/hosts.
>> Please find out more here [1], Sorabh has already answered a similar
>> question :)
>>
>> [1]
>> https://lists.apache.org/thread.html/%3CHE1PR07MB33068E59F25
>> 7a4d78f29304d84...@he1pr07mb3306.eurprd07.prod.outlook.com%3E
>>
>> Kind regards
>> Vitalii
>>
>>
>> On Mon, Jul 2, 2018 at 1:06 AM Padma Penumarthy <
>> penumarthy.pa...@gmail.com>
>> wrote:
>>
>> > I am getting the following error while trying to run unit tests on my
>> mac.
>> > Anyone has any idea what might be wrong ?
>> >
>> > 15:02:01.479 [Client-1] ERROR o.a.d.e.rpc.ConnectionMultiListener -
>> Failed
>> > to establish connection
>> > java.util.concurrent.ExecutionException:
>> > io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
>> > refused: homeportal/192.168.1.254:31010
>> > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
>> > ~[netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > org.apache.drill.exec.rpc.ConnectionMultiListener$Connection
>> Handler.operationComplete(ConnectionMultiListener.java:90)
>> > [classes/:na]
>> > at
>> >
>> > org.apache.drill.exec.rpc.ConnectionMultiListener$Connection
>> Handler.operationComplete(ConnectionMultiListener.java:77)
>> > [classes/:na]
>> > at
>> >
>> > io.netty.util.concurrent.DefaultPromise.notifyListener0(Defa
>> ultPromise.java:507)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.util.concurrent.DefaultPromise.notifyListeners0(Def
>> aultPromise.java:500)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(D
>> efaultPromise.java:479)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.util.concurrent.DefaultPromise.notifyListeners(Defa
>> ultPromise.java:420)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPr
>> omise.java:122)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fu
>> lfillConnectPromise(AbstractNioChannel.java:278)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fi
>> nishConnect(AbstractNioChannel.java:294)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>> tLoop.java:633)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>> ed(NioEventLoop.java:580)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEve
>> ntLoop.java:497)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
>> > at
>> >
>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(
>> SingleThreadEventExecutor.java:131)
>> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
>> >
>> >
>> > Thanks
>> > Padma
>> >
>>
>
>


Re: unit tests

2018-07-01 Thread Padma Penumarthy
Already tried that with no luck.

Thanks
Padma



On Sun, Jul 1, 2018 at 3:21 PM, Vitalii Diravka 
wrote:

> Hi Padma,
>
> Looks like you have wrong some hostname or ip in your /etc/hosts.
> Please find out more here [1], Sorabh has already answered a similar
> question :)
>
> [1]
> https://lists.apache.org/thread.html/%3CHE1PR07MB33068E59F257A4D78F2
> 9304d84...@he1pr07mb3306.eurprd07.prod.outlook.com%3E
>
> Kind regards
> Vitalii
>
>
> On Mon, Jul 2, 2018 at 1:06 AM Padma Penumarthy <
> penumarthy.pa...@gmail.com>
> wrote:
>
> > I am getting the following error while trying to run unit tests on my
> mac.
> > Anyone has any idea what might be wrong ?
> >
> > 15:02:01.479 [Client-1] ERROR o.a.d.e.rpc.ConnectionMultiListener -
> Failed
> > to establish connection
> > java.util.concurrent.ExecutionException:
> > io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
> > refused: homeportal/192.168.1.254:31010
> > at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
> > ~[netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.
> operationComplete(ConnectionMultiListener.java:90)
> > [classes/:na]
> > at
> >
> > org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.
> operationComplete(ConnectionMultiListener.java:77)
> > [classes/:na]
> > at
> >
> > io.netty.util.concurrent.DefaultPromise.notifyListener0(
> DefaultPromise.java:507)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.util.concurrent.DefaultPromise.notifyListeners0(
> DefaultPromise.java:500)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(
> DefaultPromise.java:479)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.util.concurrent.DefaultPromise.notifyListeners(
> DefaultPromise.java:420)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> > io.netty.util.concurrent.DefaultPromise.tryFailure(
> DefaultPromise.java:122)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.
> fulfillConnectPromise(AbstractNioChannel.java:278)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(
> AbstractNioChannel.java:294)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at
> > io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:633)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:580)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:497)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> > [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> > at
> >
> > io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:131)
> > [netty-common-4.0.48.Final.jar:4.0.48.Final]
> >
> >
> > Thanks
> > Padma
> >
>


unit tests

2018-07-01 Thread Padma Penumarthy
I am getting the following error while trying to run unit tests on my mac.
Anyone has any idea what might be wrong ?

15:02:01.479 [Client-1] ERROR o.a.d.e.rpc.ConnectionMultiListener - Failed
to establish connection
java.util.concurrent.ExecutionException:
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
refused: homeportal/192.168.1.254:31010
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)
~[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.operationComplete(ConnectionMultiListener.java:90)
[classes/:na]
at
org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.operationComplete(ConnectionMultiListener.java:77)
[classes/:na]
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:278)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:294)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
[netty-common-4.0.48.Final.jar:4.0.48.Final]


Thanks
Padma


Re: Drill Hangout tomorrow 06/26

2018-06-26 Thread Padma Penumarthy
Here is the link to the document. Any feedback/comments welcome.

https://docs.google.com/document/d/1Z-67Y_KNcbA2YYWCHEwf2PUEmXRPWSXsw-CHnXW_98Q/edit?usp=sharing

Thanks
Padma


On Jun 26, 2018, at 12:12 PM, Aman Sinha 
mailto:amansi...@gmail.com>> wrote:

Hangout attendees on 06/26:
Padma, Hanumath, Boaz, Aman, Jyothsna, Sorabh, Arina, Bohdan, Vitalii,
Volodymyr, Abhishek, Robert

2 topics were discussed:
1.  Vitalii brought up the Travis timeout issue for which he has sent out
an email in this thread;  Actually Vitalli can you send it in a separate
email with explicit subject otherwise people may miss it.
2. Padma went over the batch sizing work and current status.  Padma, pls
add a link to your document.  Summarizing some of the discussion:

  - Does batch sizing affect output batches only or internal batches also
  ?  For certain operators such as HashAgg it does affect the internal
  batches held in the hash table since these batches are transferred as-is to
  the output container.
  - 16 MB limit on the batch size is a best effort but in some cases it
  could slightly exceed.  The number of rows per output batch is estimated as
  nearest lower power of 2.  For example, if based on input batch size, the
  number of output rows is 600, it will be rounded down to 512.
  - An optimization could be done in future to have upstream operator
  provide the batch size information in metadata instead of downstream
  operator computing it for each incoming.
  - There was discussion on estimating the size of complex type columns
  especially ones with nesting levels.  It would be good to add details in
  the document.


-Aman

On Tue, Jun 26, 2018 at 10:48 AM Vitalii Diravka 
mailto:vitalii.dira...@gmail.com>>
wrote:

Lately Drill Travis Build fails more often because of Travis job time
expires.
The right way is to accelerate Drill execution :)

Nevertheless I believe we should consider excluding some more tests from
Travis Build.
We can add all TPCH tests (
TestTpchLimit0, TestTpchExplain, TestTpchPlanning, TestTpchExplain) to the
SlowTest category.

Is there other solution for this issue? What are other tests are executed
very slowly?

Kind regards
Vitalii


On Tue, Jun 26, 2018 at 3:34 AM Aman Sinha 
mailto:amansi...@apache.org>> wrote:

We'll have the Drill hangout tomorrow Jun26th, 2018 at 10:00 PDT.

If you have any topics to discuss, send a reply to this post or just join
the hangout.

( Drill hangout link

 )



Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-10 Thread Padma Penumarthy
Yes, parquet is always better for multiple reasons. With JSON, we have to read 
the whole file
from a single reader thread and have to parse to read individual columns. 
Parquet compresses and encodes data on disk. So, we read much less data from 
disk.
Drill can read individual columns with in each rowgroup in parallel. Also, we 
can leverage
features like filter pushdown, partition pruning, metadata cache for better 
query performance. 

Thanks
Padma

> On Jun 10, 2018, at 8:22 PM, Abhishek Girish  wrote:
> 
> I would suggest converting the JSON files to parquet for better
> performance. JSON supports a more free form data model, so that's a
> trade-off you need to consider, in my opinion.
> On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot 
> wrote:
> 
>> Hi,
>> I am looking for the advise regarding the performance for below :
>> 1. keep the JSON as is
>> 2. Convert the JSON file to parquet files
>> 
>> My JSON files data is not in fixed format and  file size varies from 10 KB
>> to 1 MB.
>> 
>> Appreciate the community users advise on above !
>> 
>> 
>> Thanks,
>> Divya
>> 



Re: Apache Drill issue

2018-06-04 Thread Padma Penumarthy
Did you verify the permissions ? Check the drillbit log. 
That will give some clues.

Thanks
Padma


> On Jun 3, 2018, at 7:28 AM, Samiksha Kapoor  
> wrote:
> 
> Hi Team,
> 
> I am doing a POC on Apache drill for my organization, I have installed
> Drill on my Linux environment. However i am constantly facing one issue
> even after doing so many configurations. When i am trying to view the hive
> tables, its giving no output to me, the query is executing fine without any
> results.
> 
> Please help me out to fine the resolution so that i can do a successful POC
> for my organization.
> 
> Looking forward to hear from you.
> 
> Thanks,
> Samiksha Kapoor



Re: Read complex json file gives list type doesn't support different data types

2018-05-30 Thread Padma Penumarthy
yes, that is correct.
You can try setting the option “exec.enable_union_type” for that to work with 
the caveat that
union type is not fully supported in drill.

Thanks
Padma


> On May 30, 2018, at 7:56 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am reading a complex json file, I am getting format doesn't support while
> reading below :
> "Coordinates":[
>[
>   23.53,
>   4.99,
>   11
>],
>[
>   35.09,
>   7.7,
>   16
>]
> ]
> 
> 
> Error : Query execution error. Details:[
>> UNSUPPORTED_OPERATION ERROR: In a list of type FLOAT8, encountered a value
>> of type BIGINT. Drill does not support lists of different types.
>> Line  15
>> Column  19
>> Field  Coordinates
>> Line  15
>> Column  19
>> Field  Coordinates
>> Line  15
>> Column  19
>> Field  Coordinates
>> Fragment 0:0
> 
> 
> If I remove the third coordinates(11,16) which is integer it works like
> charm .
> 
> Does that means Drill doesn't support values of different data types in
> array list?
> 
> Appreciate the help !
> 
> Thanks,
> Divya



Re: performance of a query executed on json vs paraquet

2018-05-15 Thread Padma Penumarthy
Is it the same json file you converted to parquet that you want to compare 
performance against ?

You can look at the query profiles (either the saved json files or from web UI) 
to see how much time each of them took to execute.
You can also see detailed stats per operator and compare.

Thanks
Padma


> On May 15, 2018, at 11:20 AM, Ashwini Guler  wrote:
> 
> Hi Team,
> 
> How do I check the performance of  query run on paraquet file versus a
> query run on json file on apache drill?
> -- 
> Regards,
> Ashwini



Re: Setting up drill to query AWS S3 behind a proxy

2018-03-12 Thread Padma Penumarthy
Not sure what exactly you mean by proxy settings.
But, here is what you can do to access files on S3. 
Enable S3 storage plugin, update the connection string, access key and secret 
key in the config.
If it is able to connect fine, you should see s3.root when you do show 
databases.

Thanks
Padma


> On Mar 12, 2018, at 12:43 PM, Edelman, Tyler  
> wrote:
> 
> Hello,
> 
> I am currently trying to set up drill locally to query a JSON file in 
> Amazon’s AWS S3. I have not been able to configure proxy settings for drill. 
> Could you send me a configuration example of this?
> 
> Thank you,
> Tyler Edelman
> 
> 
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates and may only be used solely in 
> performance of work or services for Capital One. The information transmitted 
> herewith is intended only for use by the individual or entity to which it is 
> addressed. If the reader of this message is not the intended recipient, you 
> are hereby notified that any review, retransmission, dissemination, 
> distribution, copying or other use of, or taking of any action in reliance 
> upon this information is strictly prohibited. If you have received this 
> communication in error, please contact the sender and delete the material 
> from your computer.



Re: [Drill 1.10.0/1.12.0] Query Started Taking Time + frequent one or more node lost connectivity error

2018-03-12 Thread Padma Penumarthy
There can be lot of issues here.
Connection loss error can happen when zookeeper thinks that a node is dead 
because
it did not get heartbeat from the node. It can be because the node is busy or 
you have
network problems. Did anything changed in your network ? 
Is the data static or are you adding new data ? Do you have metadata caching 
enabled ?
PARQUET_WRITER seem to be indicate you are doing some kind of CTAS.
The block missing exception could possibly mean some problem with name node or 
bad disks
on one of the node. 

Thanks
Padma

> On Mar 12, 2018, at 1:27 AM, Anup Tiwari  wrote:
> 
> Hi All,
> From last couple of days i am stuck in a problem. I have a query which left
> joins 3 drill tables(parquet), everyday it is used to take around 15-20 mins 
> but
> from last couple of days it is taking more than 45 mins and when i tried to
> drill down i can see in operator profile that 40% query time is going to
> PARQUET_WRITER and 28% time in PARQUET_ROW_GROUP_SCAN. I am not sure if before
> this issue the stats were same or not as earlier it gets executed in 15-20 min
> max.Also on top of this a table, we used to create a table which is now 
> showing
> below error :-
> SYSTEM ERROR: BlockMissingException: Could not obtain block:
> BP-1083556055-10.51.2.101-148327179:blk_1094763477_21022752
> Also in last few days i am getting frequent one or more node lost connectivity
> error.
> I just upgraded to Drill 1.12.0 from 1.10.0 but above issues are still there.
> Any help will be appreciated.
> Regards,
> Anup Tiwari



Re: Accessing underlying scheme of input

2018-03-01 Thread Padma Penumarthy
Not sure why it is not showing the fields. It does not work for me either. 
Does anyone know more ? Is this broken ? 

Thanks
Padma 

> On Mar 1, 2018, at 2:54 PM, Erol Akarsu <eaka...@gmail.com> wrote:
> 
> Somehow, after "user dfs.tmp", I was able to create view. But "describe"
> for view does not give much information. I was expecting "describe" command
> would give type  definitions of fields " employee_id  |full_name |
> first_name  | last_name  | position_id  |   position_title| store_id  |
> depart ". But it does give a very generic field type.
> 
> 
> 0: jdbc:drill:zk=local> create view mydonuts2 as SELECT * FROM
> cp.`employee.json` LIMIT 3;
> +---++
> |  ok   |  summary   |
> +---++
> | true  | View 'mydonuts2' created successfully in 'dfs.tmp' schema  |
> +---++
> 1 row selected (0.283 seconds)
> 0: jdbc:drill:zk=local> describe mydonuts2;
> +--++--+
> | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> +--++--+
> | *| ANY| YES  |
> +--++--+
> 1 row selected (0.388 seconds)
> 0: jdbc:drill:zk=local>  SELECT * FROM cp.`employee.json` LIMIT 3;
> +--+--+-++--+-+---++
> | employee_id  |full_name | first_name  | last_name  | position_id
> |   position_title| store_id  | depart |
> +--+--+-++--+-+---++
> | 1| Sheri Nowmer | Sheri   | Nowmer | 1
> | President   | 0 | 1  |
> | 2| Derrick Whelply  | Derrick | Whelply| 2
> | VP Country Manager  | 0 | 1  |
> | 4| Michael Spence   | Michael | Spence | 2
> | VP Country Manager  | 0 | 1  |
> +--+--+-++--+-+---++
> 3 rows selected (0.579 seconds)
> 
> On Thu, Mar 1, 2018 at 3:18 PM, Erol Akarsu <eaka...@gmail.com> wrote:
> 
>> Padma,
>> 
>> I have not created any user. I just installed the system and run drill
>> with  "sqlline.bat -u "jdbc:drill:zk=local"
>> Therefore, what is shortest procedure to achieve what you have described
>> in previous email?
>> 
>> Thanks
>> 
>> Erol Akarsu
>> 
>> On Thu, Mar 1, 2018 at 3:00 PM, Padma Penumarthy <ppenumar...@mapr.com>
>> wrote:
>> 
>>> Check if you have permissions to root directory or not.
>>> You may have to specify the complete directory path (for which you have
>>> permissions for) in the create view command.
>>> 
>>> For example:
>>> 
>>> 0: jdbc:drill:zk=local> create view 
>>> dfs.root.`/Users/ppenumarthy/parquet/test-view`
>>> as select * from dfs.root.`/Users/ppenumarthy/parquet/0_0_0.parquet`;
>>> +---+---
>>> -+
>>> |  ok   |summary
>>>   |
>>> +---+---
>>> -+
>>> | true  | View '/Users/ppenumarthy/parquet/test-view' created
>>> successfully in 'dfs.root' schema  |
>>> +---+---
>>> -+
>>> 1 row selected (0.148 seconds)
>>> 0: jdbc:drill:zk=local>
>>> 
>>> 
>>> Thanks
>>> Padma
>>> 
>>> On Mar 1, 2018, at 11:37 AM, Erol Akarsu <eaka...@gmail.com<mailto:eaka
>>> r...@gmail.com>> wrote:
>>> 
>>> Padma,
>>> 
>>> I have changed dfs storage plugin through web interface as below. But I am
>>> getting same error response.
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "file:///",
>>> "config": null,
>>> "workspaces": {
>>>   "root": {
>>> "location": "/",
>>> "writable": true,
>>&g

Re: Accessing underlying scheme of input

2018-03-01 Thread Padma Penumarthy
Check if you have permissions to root directory or not.
You may have to specify the complete directory path (for which you have 
permissions for) in the create view command.

For example:

0: jdbc:drill:zk=local> create view 
dfs.root.`/Users/ppenumarthy/parquet/test-view` as select * from 
dfs.root.`/Users/ppenumarthy/parquet/0_0_0.parquet`;
+---++
|  ok   |summary
 |
+---++
| true  | View '/Users/ppenumarthy/parquet/test-view' created successfully in 
'dfs.root' schema  |
+---++
1 row selected (0.148 seconds)
0: jdbc:drill:zk=local>


Thanks
Padma

On Mar 1, 2018, at 11:37 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com>> wrote:

Padma,

I have changed dfs storage plugin through web interface as below. But I am
getting same error response.

{
 "type": "file",
 "enabled": true,
 "connection": "file:///",
 "config": null,
 "workspaces": {
   "root": {
 "location": "/",
 "writable": true,
 "defaultInputFormat": null,
 "allowAccessOutsideWorkspace": true
   },
   "tmp": {
 "location": "/tmp",
 "writable": true,
 "defaultInputFormat": null,
 "allowAccessOutsideWorkspace": true
   }
 },

On Thu, Mar 1, 2018 at 1:15 PM, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>>
wrote:

Make "writable": true for the workspace (dfs.root) in the storage plugin
configuration.

Thanks
Padma


On Mar 1, 2018, at 10:10 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com><mailto:eaka
r...@gmail.com<mailto:r...@gmail.com>>> wrote:

Thanks Padma.

I am getting problem while creating view

0: jdbc:drill:zk=local> create view mydonuts as SELECT * FROM
cp.`employee.json` LIMIT 3;
Error: VALIDATION ERROR: Root schema is immutable. Creating or dropping
tables/views is not allowed in root schema.Select a schema using 'USE
schema' command.


[Error Id: 68a31047-5a4e-4768-8722-55648d9a80f6 on DESKTOP-8OANV3A:31010]
(state=,code=0)
0: jdbc:drill:zk=local>

On Thu, Mar 1, 2018 at 12:49 PM, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>
<mailto:ppenumar...@mapr.com>>
wrote:

Try creating a view and use describe.

https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
apache.org_docs_describe_=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=
XVqW14B9eGK9QR_fKKCb5H5LxKnqNMmq1U7RdNlhq1c=blTmu-WQJa5RUrxqG46o20B-a-
UP0yCHKGqcBo1ETlI=SxKqC4w_I5bbnNAtIvzO_9qptLP-KsHi0iIbjtzyrc8=

Thanks
Padma


On Mar 1, 2018, at 9:22 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com><mailto:eaka
r...@gmail.com<mailto:r...@gmail.com>><mailto:eaka
r...@gmail.com<mailto:r...@gmail.com><mailto:r...@gmail.com>>> wrote:

When Use limit 0 query,  I am getting only field names. I am looking for
json schema for input that will describe input type

0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;

+---+---+---++-+
| trans_id  | date  | time  | user_info  | trans_info  |
+---+---+---++-+
+---+---+---++-+
No

On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com>mailto:eaka...@gmail.com>>mailto:eaka...@gmail.com><mailto:eaka...@gmail.com>>> wrote:

I am sorry Sorabh
Can you give an example? I am still learning Drill
Thanks

On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia 
<shamirwa...@mapr.com<mailto:shamirwa...@mapr.com>
<mailto:shamirwa...@mapr.com>
<mailto:shamirwa...@mapr.com>>
wrote:

Hi Erol,

You can run limit 0 query from client to retrieve just the schema for
your input.


Thanks,
Sorabh


From: Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com><mailto:eaka...@gmail.com>mailto:eaka...@gmail.com>>>
Sent: Thursday, March 1, 2018 5:28:52 AM
To: 
user@drill.apache.org<mailto:user@drill.apache.org><mailto:user@drill.apache.org>mailto:s...@drill.apache.org>>
Subject: Accessing underlying scheme of input

I know Apache drill is creating a json schema for input data file or hdfs
input before user query on it.
I like to know whether or not Apache drill has API that will help user to
obtain that  derived schema for say an json file or excel file or hive
input.
I appreciate your help

Erol Akarsu

Sent from Mail for Windows 10

--

Erol Akarsu




--

Erol Akarsu




--

Erol Akarsu




--

Erol Akarsu



Re: Accessing underlying scheme of input

2018-03-01 Thread Padma Penumarthy
Make "writable": true for the workspace (dfs.root) in the storage plugin 
configuration.

Thanks
Padma


On Mar 1, 2018, at 10:10 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com>> wrote:

Thanks Padma.

I am getting problem while creating view

0: jdbc:drill:zk=local> create view mydonuts as SELECT * FROM
cp.`employee.json` LIMIT 3;
Error: VALIDATION ERROR: Root schema is immutable. Creating or dropping
tables/views is not allowed in root schema.Select a schema using 'USE
schema' command.


[Error Id: 68a31047-5a4e-4768-8722-55648d9a80f6 on DESKTOP-8OANV3A:31010]
(state=,code=0)
0: jdbc:drill:zk=local>

On Thu, Mar 1, 2018 at 12:49 PM, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>>
wrote:

Try creating a view and use describe.

https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.apache.org_docs_describe_=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=XVqW14B9eGK9QR_fKKCb5H5LxKnqNMmq1U7RdNlhq1c=blTmu-WQJa5RUrxqG46o20B-a-UP0yCHKGqcBo1ETlI=SxKqC4w_I5bbnNAtIvzO_9qptLP-KsHi0iIbjtzyrc8=

Thanks
Padma


On Mar 1, 2018, at 9:22 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com><mailto:eaka
r...@gmail.com<mailto:r...@gmail.com>>> wrote:

When Use limit 0 query,  I am getting only field names. I am looking for
json schema for input that will describe input type

0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;

+---+---+---++-+
| trans_id  | date  | time  | user_info  | trans_info  |
+---+---+---++-+
+---+---+---++-+
No

On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com>mailto:eaka...@gmail.com>>> wrote:

I am sorry Sorabh
Can you give an example? I am still learning Drill
Thanks

On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia 
<shamirwa...@mapr.com<mailto:shamirwa...@mapr.com>
<mailto:shamirwa...@mapr.com>>
wrote:

Hi Erol,

You can run limit 0 query from client to retrieve just the schema for
your input.


Thanks,
Sorabh


From: Erol Akarsu 
<eaka...@gmail.com<mailto:eaka...@gmail.com><mailto:eaka...@gmail.com>>
Sent: Thursday, March 1, 2018 5:28:52 AM
To: 
user@drill.apache.org<mailto:user@drill.apache.org><mailto:user@drill.apache.org>
Subject: Accessing underlying scheme of input

I know Apache drill is creating a json schema for input data file or hdfs
input before user query on it.
I like to know whether or not Apache drill has API that will help user to
obtain that  derived schema for say an json file or excel file or hive
input.
I appreciate your help

Erol Akarsu

Sent from Mail for Windows 10

--

Erol Akarsu




--

Erol Akarsu




--

Erol Akarsu



Re: Accessing underlying scheme of input

2018-03-01 Thread Padma Penumarthy
Try creating a view and use describe.

https://drill.apache.org/docs/describe/

Thanks
Padma


On Mar 1, 2018, at 9:22 AM, Erol Akarsu 
> wrote:

When Use limit 0 query,  I am getting only field names. I am looking for
json schema for input that will describe input type

0: jdbc:drill:> select * from `clicks/clicks.json` limit 0;

+---+---+---++-+
| trans_id  | date  | time  | user_info  | trans_info  |
+---+---+---++-+
+---+---+---++-+
No

On Thu, Mar 1, 2018 at 11:24 AM, Erol Akarsu 
> wrote:

I am sorry Sorabh
Can you give an example? I am still learning Drill
Thanks

On Thu, Mar 1, 2018 at 11:11 AM Sorabh Hamirwasia 
>
wrote:

Hi Erol,

You can run limit 0 query from client to retrieve just the schema for
your input.


Thanks,
Sorabh


From: Erol Akarsu >
Sent: Thursday, March 1, 2018 5:28:52 AM
To: user@drill.apache.org
Subject: Accessing underlying scheme of input

I know Apache drill is creating a json schema for input data file or hdfs
input before user query on it.
I like to know whether or not Apache drill has API that will help user to
obtain that  derived schema for say an json file or excel file or hive
input.
I appreciate your help

Erol Akarsu

Sent from Mail for Windows 10

--

Erol Akarsu




--

Erol Akarsu



Re: S3 Connection Issues

2018-02-13 Thread Padma Penumarthy
Yes, I built it by changing the version in pom file.
Try and see if what Arjun suggested works.
If not, you can download the source, change the version and build or
if you prefer, I can provide you with a private build that you can try with.

Thanks
Padma


On Feb 13, 2018, at 1:46 AM, Anup Tiwari 
<anup.tiw...@games24x7.com<mailto:anup.tiw...@games24x7.com>> wrote:

Hi Padma,
As you have mentioned "Last time I tried, using Hadoop 2.8.1 worked for me." so
have you build drill with hadoop 2.8.1 ? If yes then can you provide steps ?
Since i have downloaded tar ball of 1.11.0 and replaced hadoop-aws-2.7.1.jar
with hadoop-aws-2.9.0.jar but still not able to query successfully to s3 bucket;
queries are going in starting state.
We are trying to query : "ap-south-1" region which supports only v4 signature.





On Thu, Oct 19, 2017 9:44 AM, Padma Penumarthy 
ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>  wrote:
Which AWS region are you trying to connect to ?

We have a problem connecting to regions which support only v4 signature

since the version of hadoop we include in Drill is old.

Last time I tried, using Hadoop 2.8.1 worked for me.



Thanks

Padma





On Oct 18, 2017, at 8:14 PM, Charles Givre 
<cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:



Hello all,

I’m trying to use Drill to query data in an S3 bucket and running into some
issues which I can’t seem to fix. I followed the various instructions online to
set up Drill with S3, and put my keys in both the conf-site.xml and in the
plugin config, but every time I attempt to do anything I get the following
errors:





jdbc:drill:zk=local> show databases;

Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon
S3, AWS Request ID: 56D1999BD1E62DEB, AWS Error Code: null, AWS Error Message:
Forbidden





[Error Id: 65d0bb52-a923-4e98-8ab1-65678169140e on
charless-mbp-2.fios-router.home:31010] (state=,code=0)

0: jdbc:drill:zk=local> show databases;

Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon
S3, AWS Request ID: 4D2CBA8D42A9ECA0, AWS Error Code: null, AWS Error Message:
Forbidden





[Error Id: 25a2d008-2f4d-4433-a809-b91ae063e61a on
charless-mbp-2.fios-router.home:31010] (state=,code=0)

0: jdbc:drill:zk=local> show files in s3.root;

Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon
S3, AWS Request ID: 2C635944EDE591F0, AWS Error Code: null, AWS Error Message:
Forbidden





[Error Id: 02e136f5-68c0-4b47-9175-a9935bda5e1c on
charless-mbp-2.fios-router.home:31010] (state=,code=0)

0: jdbc:drill:zk=local> show schemas;

Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon
S3, AWS Request ID: 646EB5B2EBCF7CD2, AWS Error Code: null, AWS Error Message:
Forbidden





[Error Id: 954aaffe-616a-4f40-9ba5-d4b7c04fe238 on
charless-mbp-2.fios-router.home:31010] (state=,code=0)



I have verified that the keys are correct but using the AWS CLI and downloaded
some of the files, but I’m kind of at a loss as to how to debug. Any
suggestions?

Thanks in advance,

— C







Regards,
Anup Tiwari

Sent with Mixmax



Re: Hangout Topics for 11/14/2017

2017-11-13 Thread Padma Penumarthy
Here are the topics so far:

Unit Testing - Tim
1.12 Release - Arina
Metadata Management - Padma

Thanks
Padma


On Nov 13, 2017, at 1:15 PM, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:

Drill hangout tomorrow Nov 14th, at 10 AM PST.
Please send email or bring them up tomorrow, if you have topics to discuss.

Hangout link:
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks
Padma




Hangout Topics for 11/14/2017

2017-11-13 Thread Padma Penumarthy
Drill hangout tomorrow Nov 14th, at 10 AM PST.
Please send email or bring them up tomorrow, if you have topics to discuss.

Hangout link:
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks
Padma



Re: S3 Performance

2017-11-05 Thread Padma Penumarthy
Hi Uwe,

This is lot of good information. We should document it in a JIRA.

BTW, I just checked and apparently, Hadoop 2.8.2 is released recently, which 
they claim
is the first GA release. 
I think we can attempt to move to Hadoop 2.8.2 after Drill 1.12 is released.  
Yes, some unit tests were failing last time I tried 2.8.1.  But, I think we can 
fix them.

Thanks
Padma


> On Nov 5, 2017, at 8:27 AM, Uwe L. Korn  wrote:
> 
> Hello Charles, 
> 
> I ran into the same performance issues some time ago and did make some
> discoveries:
> 
> * Drill is good at only pulling the byte ranges out of the file system
> it needs. Sadly, s3a in Hadoop 2.7 is translating a request to the byte
> range (x,y) into a HTTP request to S3 of the byte range
> (x,end-of-file). In the case of Parquet, this means that you will read
> for each column in each row group from the beginning of this column
> chunk to the end of the file. Overall this amounted for me for a
> traffic of 10-20x the size of the actual file in total.
> * Hadoop 2.8/3.0 actually introduces a new S3 experimental random
> access mode that really improves performance as this will only send
> requests of (x, y+readahead.range) to S3. You can activate it with
> fs.s3a.experimental.input.fadvise=random.
> * I played a bit with fs.s3a.readahead.range which is optimistic range
> that is included in the request but actually found that I could keep it
> at its default of 65536 bytes as Drill often requests all bytes it
> needs at once and thus reading ahead did not improve the situation.
> * This random access mode plays well with Parquet files but sadly
> slowed down the read of the metadata cache drastically as only requests
> of the size 65540 were done to S3. Therefore I had to add
> is.setReadahead(filesize); after
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L593
> to ensure that the metadata cache is read at once from S3.
> * Also
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L662
> seem to have been always true in my case, causing a refresh of the
> cache on every query. As I had quite a big dataset, this added a large
> constant to my query. This might be simply due to the fact that S3 does
> not have the concept of directories. I have not digged deeper into this
> but added as a dirty workaround that once the cache exists, it is never
> updated automatically.
> 
> Locally I have made my own Drill build based on the Hadoop 2.8 libraries
> but sadly some unit tests failed, at least for the S3 testing,
> everything seems to work. Work is still on the 1.11 release sources and
> some code has changed since then. I will have some time in the next
> days/weeks to look again at this and might open some PRs (don't expect
> me to be the one to open the Hadoop-Update PR, I'm a full-time Python
> dev, so this is a bit out of my comfort zone :D ).  At least for my
> basic tests, this resulted in a quite performant setup for me (embedded
> and in distributed mode).
> 
> Cheers
> Uwe
> 
> On Sun, Nov 5, 2017, at 02:29 AM, Charles Givre wrote:
>> Hello everyone, 
>> I’m experimenting with Drill on S3 and I’ve been pretty disappointed with
>> the performance.  I’m curious as to what kind of performance I can
>> expect?  Also what can be done to improve performance on S3.  My current
>> config is I am using Drill in embedded mode with a corporate S3 bucket. 
>> Thanks,
>> — C



Re: Time series storage with parquet

2017-10-31 Thread Padma Penumarthy
parquet-tools can be used only for inspecting parquet files, not for creating 
new
parquet files.  Yes, you can use CTAS to do this. You have to manually remove 
the
old files and move the new files. 
It does impact the metadata caching mechanism. 
You need to regenerate metadata cache.

Thanks
Padma


> On Oct 31, 2017, at 5:08 AM, Rahul Raj  
> wrote:
> 
> Hi,
> 
> I have few questions on modeling a time series use case with parquet and
> drill. I have seen the topic discussed at
> https://issues.apache.org/jira/browse/DRILL-3534.
> 
> My requirements are:
> 
> * Keep the parquet files partitioned by year and month
> * For the current month, the data needs to be further partitioned by Week
> and Day
> * End of the running week, 7 daily parquets will be merged to a single
> weekly file
> * Similarly, weekly files will to be merged to form a monthly file during
> month end
> 
> I will have a web application to generate the daily data and to ensure the
> batch runs/ atomic writes/locking etc.
> 
> What are the possible ways to merge parquet files? Another CTAS?
> 
> Is it possible to use parquet-tools(part of Parquet-MR) to merge multiple
> parquets(java jar ./parquet-tools-.jar  
> ) and then let drill query the results?. Will it impact the
> drill meta data caching mechanism?
> 
> Regards,
> Rahul
> 
> -- 
>  This email and any files transmitted with it are confidential and 
> intended solely for the use of the individual or entity to whom it is 
> addressed. If you are not the named addressee then you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately and delete this e-mail from your system.



Re: S3 Connection Issues

2017-10-25 Thread Padma Penumarthy
Yes, I am also using drill in embedded mode. Do you have at least read access 
to the bucket ?
Otherwise, I do not think this will work.

Thanks
Padma


> On Oct 25, 2017, at 8:19 AM, Charles Givre <cgi...@gmail.com> wrote:
> 
> Hi Padma, 
> I have been using drill in embedded mode.  Would that make a difference?  
> Also, I’m wondering if Drill might be trying to use a blocked port for S3.  
> —C
> 
> 
>> On Oct 24, 2017, at 2:00 PM, Padma Penumarthy <ppenumar...@mapr.com> wrote:
>> 
>> Yes, I guess you need to have access to the bucket.
>> Not sure how it will work otherwise.
>> 
>> Thanks
>> Padma
>> 
>> 
>> On Oct 24, 2017, at 10:51 AM, Charles Givre 
>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:
>> 
>> Hi Padma,
>> I’m wondering if the issue is that I only have access to a subfolder in the 
>> s3 bucket.  IE:
>> 
>> s3://bucket/folder1/folder2/folder3 
>> 
>> I only have access to folder3.  Might that be causing the issue?
>> —C
>> 
>> 
>> 
>> On Oct 24, 2017, at 13:49, Padma Penumarthy 
>> <ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:
>> 
>> Charles, can you try exactly what I did.
>> I did not do anything else other than enable the S3 plugin and change the 
>> plugin
>> configuration like this.
>> 
>> {
>> "type": "file",
>> "enabled": true,
>> "connection": "s3a://",
>> "config": {
>> "fs.s3a.access.key": “",
>> "fs.s3a.secret.key": “"
>> },
>> 
>> 
>> Thanks
>> Padma
>> 
>> 
>> On Oct 24, 2017, at 10:06 AM, Charles Givre 
>> <cgi...@gmail.com<mailto:cgi...@gmail.com> <mailto:cgi...@gmail.com>> wrote:
>> 
>> Hi everyone and thank you for your help.  I’m still not able to connect to 
>> S3.
>> 
>> 
>> Here is the error I’m getting:
>> 
>> 0: jdbc:drill:zk=local> use s3;
>> Error: RESOURCE ERROR: Failed to create schema tree.
>> 
>> 
>> [Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>> java.sql.SQLException: RESOURCE ERROR: Failed to create schema tree.
>> 
>> 
>> [Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
>> charless-mbp-2.fios-router.home:31010]
>> at 
>> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:489)
>> at 
>> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:561)
>> at 
>> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1895)
>> at 
>> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:61)
>> at 
>> org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:473)
>> at 
>> org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1100)
>> at 
>> org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:477)
>> at 
>> org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareAndExecuteInternal(DrillConnectionImpl.java:181)
>> at 
>> org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:109)
>> at 
>> org.apache.calcite.avatica.AvaticaStatement.execute(AvaticaStatement.java:121)
>> at 
>> org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:101)
>> at sqlline.Commands.execute(Commands.java:841)
>> at sqlline.Commands.sql(Commands.java:751)
>> at sqlline.SqlLine.dispatch(SqlLine.java:746)
>> at sqlline.SqlLine.begin(SqlLine.java:621)
>> at sqlline.SqlLine.start(SqlLine.java:375)
>> at sqlline.SqlLine.main(SqlLine.java:268)
>> Caused by: org.apache.drill.common.exceptions.UserRemoteException: RESOURCE 
>> ERROR: Failed to create schema tree.
>> 
>> 
>> [Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
>> charless-mbp-2.fios-router.home:31010]
>> at 
>> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
>> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:368)
>> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:90)
>> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274)
>> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244)
>> at 
>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>> at 
>> io

Re: S3 Connection Issues

2017-10-24 Thread Padma Penumarthy
Yes, I guess you need to have access to the bucket.
Not sure how it will work otherwise.

Thanks
Padma


On Oct 24, 2017, at 10:51 AM, Charles Givre 
<cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:

Hi Padma,
I’m wondering if the issue is that I only have access to a subfolder in the s3 
bucket.  IE:

s3://bucket/folder1/folder2/folder3 

I only have access to folder3.  Might that be causing the issue?
—C



On Oct 24, 2017, at 13:49, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:

Charles, can you try exactly what I did.
I did not do anything else other than enable the S3 plugin and change the plugin
configuration like this.

{
"type": "file",
"enabled": true,
"connection": "s3a://",
"config": {
  "fs.s3a.access.key": “",
  "fs.s3a.secret.key": “"
 },


Thanks
Padma


On Oct 24, 2017, at 10:06 AM, Charles Givre 
<cgi...@gmail.com<mailto:cgi...@gmail.com> <mailto:cgi...@gmail.com>> wrote:

Hi everyone and thank you for your help.  I’m still not able to connect to S3.


Here is the error I’m getting:

0: jdbc:drill:zk=local> use s3;
Error: RESOURCE ERROR: Failed to create schema tree.


[Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)
java.sql.SQLException: RESOURCE ERROR: Failed to create schema tree.


[Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
charless-mbp-2.fios-router.home:31010]
at 
org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:489)
at 
org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:561)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1895)
at 
org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:61)
at 
org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:473)
at 
org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1100)
at 
org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:477)
at 
org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareAndExecuteInternal(DrillConnectionImpl.java:181)
at 
org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:109)
at 
org.apache.calcite.avatica.AvaticaStatement.execute(AvaticaStatement.java:121)
at 
org.apache.drill.jdbc.impl.DrillStatementImpl.execute(DrillStatementImpl.java:101)
at sqlline.Commands.execute(Commands.java:841)
at sqlline.Commands.sql(Commands.java:751)
at sqlline.SqlLine.dispatch(SqlLine.java:746)
at sqlline.SqlLine.begin(SqlLine.java:621)
at sqlline.SqlLine.start(SqlLine.java:375)
at sqlline.SqlLine.main(SqlLine.java:268)
Caused by: org.apache.drill.common.exceptions.UserRemoteException: RESOURCE 
ERROR: Failed to create schema tree.


[Error Id: 57c82d90-2166-4a37-94a0-1cfeb0cdc4b6 on 
charless-mbp-2.fios-router.home:31010]
at 
org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:368)
at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:90)
at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274)
at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at 
io.netty.channel.DefaultChannelPipeline.fireChan

Re: S3 Connection Issues

2017-10-24 Thread Padma Penumarthy
HandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> 0: jdbc:drill:zk=local> 
> 
> 
> Here is my conf.site <http://conf.site/>.xml file:
> 
>  
> fs.s3.awsAccessKeyId
> XXX
> 
> 
> 
> fs.s3.awsSecretAccessKey
>  XXX 
> 
> 
>  fs.s3n.awsAccessKeyId
>   XXX 
>  
> 
>  
>  fs.s3n.awsSecretAccessKey
>   XXX 
>  
>  
>   fs.s3a.awsAccessKeyId
>XXX 
>   
> 
>   
>   fs.s3a.awsSecretAccessKey
>XXX 
>   
> 
> And my config info:
> 
> {
>  "type": "file",
>  "enabled": true,
>  "connection": "s3://",
>  "config": null,
>  "workspaces": {
>"root": {
>  "location": "/",
>  "writable": false,
>  "defaultInputFormat": null
>}
>  },
> 
> I did copy jets3t-0.9.4.jar to the /jars/3rdparty path.  Any debugging 
> suggestions?
> —C 
> 
> 
>> On Oct 20, 2017, at 15:55, Arjun kr <arjun...@outlook.com> wrote:
>> 
>> Hi Charles,
>> 
>> 
>> I'm not aware of any such settings. As Padma mentioned in previous mail, It 
>> works fine for me by following instructions in 
>> https://drill.apache.org/docs/s3-storage-plugin/ .
>> 
>> 
>> Thanks,
>> 
>> 
>> Arjun
>> 
>> 
>> 
>> From: Charles Givre <cgi...@gmail.com>
>> Sent: Friday, October 20, 2017 11:48 PM
>> To: user@drill.apache.org
>> Subject: Re: S3 Connection Issues
>> 
>> Hi Arjun,
>> Thanks for your help.  Are there settings in S3 that would prevent Drill 
>> from connecting?  I’ll try hdfs shell, but I am able to connect with the CLI 
>> tool.   My hunch is that there is a permission not set correctly on S3 or 
>> I’m missing some config variable in Drill.
>> — C
>> 
>> 
>>> On Oct 20, 2017, at 14:12, Arjun kr <arjun...@outlook.com> wrote:
>>> 
>>> Hi  Charles,
>>> 
>>> 
>>> Any chance you can test s3 connectivity with other tools like hdfs shell or 
>>> hive in case you haven't tried already (and these tools available)? This 
>>> may help to identify if it is Drill specific issue.
>>> 
>>> 
>>> For connecting via hdfs , you may try below command.
>>> 
>>> 
>>> hadoop fs -Dfs.s3a.access.key="" -Dfs.s3a.secret.key="Y" -ls 
>>> s3a:///
>>> 
>>> 
>>> Enable DEBUG logging if needed.
>>> 
>>> 
>>> export HADOOP_ROOT_LOGGER=hadoop.root.logger=DEBUG,console
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> Arjun
>>> 
>>> 
>>> 
>>> From: Padma Penumarthy <ppenumar...@mapr.com>
>>> Sent: Friday, October 20, 2017 3:00 AM
>>> To: user@drill.apache.org
>>> Subject: Re: S3 Connection Issues
>>> 
>>> Hi Charles,
>>> 
>>> I tried us-west-2 and it worked fine for me with drill built from latest 
>>> source.
>>> I did not do anything special.
>>> Just enabled the S3 plugin and updated the plugin configuration like this.
>>> 
>>> {
>>> "type": "file",
>>> "enabled": true,
>>> "connection": "s3a://",
>>> "config": {
>>>  "fs.s3a.access.key": “",
>>>  "fs.s3a.secret.key": “"
>>> },
>>> 
>>> I am able to do show databases and also can query the parquet files I 
>>> uploaded to the bucket.
>>> 
>>> 0: jdbc:drill:zk=local> show da

Re: S3 with mixed files

2017-10-19 Thread Padma Penumarthy
From your error log, it seems like you may be specifying the table incorrectly.
Instead of 'ibios3.root.tracking/tracking.log’, can you try 
ibios3.root.`tracking/tracking.log`

i.e. for example, select * from ibios3.root.`tracking/tracking.log`

Thanks
Padma


> On Oct 18, 2017, at 7:15 PM, Daniel McQuillen  
> wrote:
> 
> Hi,
> 
> Attempting to use Apache Drill to parse Open edX tracking log files I have
> stored on S3.
> 
> I've successfully set up an S3 connection and I can see my different
> directories in the target S3 bucket when I type `show files;` in embedded
> drill. Hooray!
> 
> However, I can't seem to do a query. I keep getting a "not found" error
> 
> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
> column 15 to line 1, column 20: Table 'ibios3.root.tracking/tracking.log'
> not found
> 
> The "tracking" subdirectory has a most recent `tracking.log` file as well
> as a bunch of  gzipped older files, e.g. `tracking-log-20170518-1234.gz`
> ... could this be confusing Drill? I've tried querying an individual file
> (tracking.log) as well as the directory itself, but not luck.
> 
> Thanks for any thoughts!
> 
> 
> - Daniel



Re: S3 Connection Issues

2017-10-19 Thread Padma Penumarthy
Hi Charles,

I tried us-west-2 and it worked fine for me with drill built from latest source.
I did not do anything special.
Just enabled the S3 plugin and updated the plugin configuration like this.

{
  "type": "file",
  "enabled": true,
  "connection": "s3a://",
  "config": {
"fs.s3a.access.key": “",
"fs.s3a.secret.key": “"
  },

I am able to do show databases and also can query the parquet files I uploaded 
to the bucket.

0: jdbc:drill:zk=local> show databases;
+-+
| SCHEMA_NAME |
+-+
| INFORMATION_SCHEMA  |
| cp.default  |
| dfs.default |
| dfs.root|
| dfs.tmp |
| s3.default  |
| s3.root |
| sys |
+-+
8 rows selected (2.892 seconds)


Thanks
Padma

On Oct 18, 2017, at 9:18 PM, Charles Givre 
<cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:

Hi Padma,
The bucket is is us-west-2.  I also discovered that some of the variable names 
in the documentation on the main Drill site are incorrect.  Do I need to 
specify the region in the configuration somewhere?

As an update, after discovering that the variable names are incorrect and that 
I didn’t have Jets3t installed properly, I’m now getting the following error:

jdbc:drill:zk=local> show databases;
Error: RESOURCE ERROR: Failed to create schema tree.


[Error Id: e6012aa2-c775-46b9-b3ee-0af7d0b0871d on 
charless-mbp-2.fios-router.home:31010]

 (org.apache.hadoop.fs.s3.S3Exception) org.jets3t.service.S3ServiceException: 
Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML 
Error Message: SignatureDoesNotMatchThe request 
signature we calculated does not match the signature you provided. Check your 
key and signing method.
   org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get():175
   org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode():221

Thanks,
— C


On Oct 19, 2017, at 00:14, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:

Which AWS region are you trying to connect to ?
We have a  problem connecting to regions which support only v4 signature
since the version of hadoop we include in Drill is old.
Last time I tried, using Hadoop 2.8.1 worked for me.

Thanks
Padma


On Oct 18, 2017, at 8:14 PM, Charles Givre 
<cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:

Hello all,
I’m trying to use Drill to query data in an S3 bucket and running into some 
issues which I can’t seem to fix.  I followed the various instructions online 
to set up Drill with S3, and put my keys in both the conf-site.xml and in the 
plugin config, but every time I attempt to do anything I get the following 
errors:


jdbc:drill:zk=local> show databases;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
S3, AWS Request ID: 56D1999BD1E62DEB, AWS Error Code: null, AWS Error Message: 
Forbidden


[Error Id: 65d0bb52-a923-4e98-8ab1-65678169140e on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)
0: jdbc:drill:zk=local> show databases;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
S3, AWS Request ID: 4D2CBA8D42A9ECA0, AWS Error Code: null, AWS Error Message: 
Forbidden


[Error Id: 25a2d008-2f4d-4433-a809-b91ae063e61a on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)
0: jdbc:drill:zk=local> show files in s3.root;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
S3, AWS Request ID: 2C635944EDE591F0, AWS Error Code: null, AWS Error Message: 
Forbidden


[Error Id: 02e136f5-68c0-4b47-9175-a9935bda5e1c on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)
0: jdbc:drill:zk=local> show schemas;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
S3, AWS Request ID: 646EB5B2EBCF7CD2, AWS Error Code: null, AWS Error Message: 
Forbidden


[Error Id: 954aaffe-616a-4f40-9ba5-d4b7c04fe238 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)

I have verified that the keys are correct but using the AWS CLI and downloaded 
some of the files, but I’m kind of at a loss as to how to debug.  Any 
suggestions?
Thanks in advance,
— C





Re: S3 Connection Issues

2017-10-18 Thread Padma Penumarthy
Which AWS region are you trying to connect to ? 
We have a  problem connecting to regions which support only v4 signature
since the version of hadoop we include in Drill is old. 
Last time I tried, using Hadoop 2.8.1 worked for me.

Thanks
Padma


> On Oct 18, 2017, at 8:14 PM, Charles Givre  wrote:
> 
> Hello all, 
> I’m trying to use Drill to query data in an S3 bucket and running into some 
> issues which I can’t seem to fix.  I followed the various instructions online 
> to set up Drill with S3, and put my keys in both the conf-site.xml and in the 
> plugin config, but every time I attempt to do anything I get the following 
> errors:
> 
> 
> jdbc:drill:zk=local> show databases;
> Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
> S3, AWS Request ID: 56D1999BD1E62DEB, AWS Error Code: null, AWS Error 
> Message: Forbidden
> 
> 
> [Error Id: 65d0bb52-a923-4e98-8ab1-65678169140e on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 0: jdbc:drill:zk=local> show databases;
> Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
> S3, AWS Request ID: 4D2CBA8D42A9ECA0, AWS Error Code: null, AWS Error 
> Message: Forbidden
> 
> 
> [Error Id: 25a2d008-2f4d-4433-a809-b91ae063e61a on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 0: jdbc:drill:zk=local> show files in s3.root;
> Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
> S3, AWS Request ID: 2C635944EDE591F0, AWS Error Code: null, AWS Error 
> Message: Forbidden
> 
> 
> [Error Id: 02e136f5-68c0-4b47-9175-a9935bda5e1c on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 0: jdbc:drill:zk=local> show schemas;
> Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 403, AWS Service: Amazon 
> S3, AWS Request ID: 646EB5B2EBCF7CD2, AWS Error Code: null, AWS Error 
> Message: Forbidden
> 
> 
> [Error Id: 954aaffe-616a-4f40-9ba5-d4b7c04fe238 on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 
> I have verified that the keys are correct but using the AWS CLI and 
> downloaded some of the files, but I’m kind of at a loss as to how to debug.  
> Any suggestions?
> Thanks in advance,
> — C



Re: Exception when querying parquet data

2017-10-09 Thread Padma Penumarthy
which cloud service is this ?
It is not able to read parquet metadata. Did you run refresh table metadata to
generate parquet metadata ?
Can you manually check if there is parquet metadata file 
(.drill.parquet_metadata)
in the directory you used in the query i.e. `data25Goct6/websales` ?

Thanks
Padma


On Oct 9, 2017, at 5:50 AM, PROJJWAL SAHA 
> wrote:

Hello all,

I am getting the below exception when querying parquet data stored in
storage cloud service.What does this exception point to ?
The query on the same parquet files works when they are stored in
alluxio.which means the data is fine.
I am using drill 11.1

Any help is appreciated !

Regards,
Projjwal

2017-10-09 08:11:10,221 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
262498a1-4fc1-608e-a7bc-ab2c6ddc09c9: select count(*) from
`data25Goct6/websales`
2017-10-09 08:11:38,117 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
took 0 ms, numFiles: 1
2017-10-09 08:11:58,362 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
took 0 ms, numFiles: 1
2017-10-09 08:15:28,459 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
INFO  o.a.d.exec.store.parquet.Metadata - Took 105962 ms to get file
statuses
2017-10-09 08:16:00,651 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
ERROR o.a.d.exec.store.parquet.Metadata - Waited for 27187ms, but
tasks for 'Fetch parquet metadata' are not complete. Total runnable
size 29, parallelism 16.
2017-10-09 08:16:00,652 [262498a1-4fc1-608e-a7bc-ab2c6ddc09c9:foreman]
INFO  o.a.d.exec.store.parquet.Metadata - User Error Occurred: Waited
for 27187ms, but tasks for 'Fetch parquet metadata' are not complete.
Total runnable size 29, parallelism
16.org.apache.drill.common.exceptions.UserException: RESOURCE ERROR:
Waited for 27187ms, but tasks for 'Fetch parquet metadata' are not
complete. Total runnable size 29, parallelism 16.


[Error Id: d9b6ee72-2e81-49ae-846c-61a14931b7ab ]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
~[drill-common-1.11.0.jar:1.11.0]
at org.apache.drill.exec.store.TimedRunnable.run(TimedRunnable.java:151)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.Metadata.getParquetFileMetadata_v3(Metadata.java:293)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.Metadata.getParquetTableMetadata(Metadata.java:270)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.Metadata.getParquetTableMetadata(Metadata.java:255)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.Metadata.getParquetTableMetadata(Metadata.java:117)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.ParquetGroupScan.init(ParquetGroupScan.java:730)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.ParquetGroupScan.(ParquetGroupScan.java:226)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.ParquetGroupScan.(ParquetGroupScan.java:186)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.ParquetFormatPlugin.getGroupScan(ParquetFormatPlugin.java:170)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.parquet.ParquetFormatPlugin.getGroupScan(ParquetFormatPlugin.java:66)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.dfs.FileSystemPlugin.getPhysicalScan(FileSystemPlugin.java:144)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.store.AbstractStoragePlugin.getPhysicalScan(AbstractStoragePlugin.java:100)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.planner.logical.DrillTable.getGroupScan(DrillTable.java:85)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.planner.logical.DrillPushProjIntoScan.onMatch(DrillPushProjIntoScan.java:63)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
[calcite-core-1.4.0-drill-r21.jar:1.4.0-drill-r21]
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:811)
[calcite-core-1.4.0-drill-r21.jar:1.4.0-drill-r21]
at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:310)
[calcite-core-1.4.0-drill-r21.jar:1.4.0-drill-r21]
at 
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:401)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:343)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel(DefaultSqlHandler.java:242)
[drill-java-exec-1.11.0.jar:1.11.0]
at 
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:292)

Re: Drill queries not getting executed-son profile for queries not created

2017-10-09 Thread Padma Penumarthy
If there is a problem connecting to any of the storage plugins that are 
enabled, irrespective of
whether that plugin is used in the  query or not, query can hang.
Disable any plugins you are not using and remove unnecessary workspaces.
You can enable/disable different storage plugins to narrow down the problem.

Thanks
Padma


> On Oct 9, 2017, at 1:05 AM, Tushar Pathare  wrote:
> 
> Hello Team,
> What are the pre-reqs needed to be performed to 
> ensure the queries are getting fired.
> Strangely the Json profiles for queries are not getting created.
> 
> And the query log stops at this message
> 
> 
> drillbits:
> Address | User Port | Control Port | Data Port | Version |
> scflexnode02 | 31010 | 31011 | 31012 | 1.11.0 |
> 
> 2017-10-09 10:57:22,867 [main] INFO  o.apache.drill.exec.server.Drillbit - 
> Startup completed (9619 ms).
> 2017-10-09 10:57:22,869 [main] DEBUG o.apache.drill.exec.server.Drillbit - 
> Started new Drillbit.
> 2017-10-09 11:00:05,464 [2624d37a-4364-1dd5-8d2d-043f4b364b9b:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 2624d37a-4364-1dd5-8d2d-043f4b364b9b: SELECT A.SSDM_ID, A.SOURCE_SYSTEM, 
> A.SOURCE_SYSTEM_ID FROM tc.SAMPLE_MGMT.TBL_CODE_SYSTEM_MAPPING A INNER JOIN 
> tc.SAMPLE_MGMT.TBL_SAMPLE_SECURITY B ON A.SSDM_ID = B.SSDM_ID WHERE B.ACTIVE 
> = 1 AND B.USER_ID ='test'
> 2017-10-09 11:00:05,481 [2624d37a-4364-1dd5-8d2d-043f4b364b9b:foreman] DEBUG 
> o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase StoragePlugin 
> configuration with zookeeper quorum 'localhost', port '2181'.
> 2017-10-09 11:00:40,690 [qtp1218188770-76] DEBUG 
> o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase StoragePlugin 
> configuration with zookeeper quorum 'localhost', port '2181'.
> 
> The query log doesn’t proceed beyong this and query is just running.
> I checked all the connection to the databases and they are ok.
> 
> 
> 
> Tushar B Pathare MBA IT,BE IT
> Bigdata & GPFS
> Software Development & Databases
> Scientific Computing
> Bioinformatics Division
> Research
> 
> "What ever the mind of man can conceive and believe, drill can query"
> 
> Sidra Medical and Research Centre
> Sidra OPC Building
> Sidra Medical & Research Center
> PO Box 26999
> Al Luqta Street
> Education City North Campus
> ​Qatar Foundation, Doha, Qatar
> Office 4003  ext 37443 | M +974 74793547
> tpath...@sidra.org | 
> www.sidra.org
> 
> 
> Disclaimer: This email and its attachments may be confidential and are 
> intended solely for the use of the individual to whom it is addressed. If you 
> are not the intended recipient, any reading, printing, storage, disclosure, 
> copying or any other action taken in respect of this e-mail is prohibited and 
> may be unlawful. If you are not the intended recipient, please notify the 
> sender immediately by using the reply function and then permanently delete 
> what you have received. Any views or opinions expressed are solely those of 
> the author and do not necessarily represent those of Sidra Medical and 
> Research Center.



Re: Issue in queying file present in swift compliant object storage

2017-09-14 Thread Padma Penumarthy
Can you query top level directory i.e. /datalake  ?

Thanks,
Padma

> On Sep 13, 2017, at 11:43 PM, Chetan Kothari  
> wrote:
> 
> I have registered plug-in to connect to container of Oracle Storage Cloud 
> Service using Swift library.
> 
> I am able to query files present in top level in container of Oracle Storage 
> Cloud Service.
> 
> 
> 
> For example -  oscs.`select * from test_tsv` ( test_tsv is file present as 
> objet in container I configured in storage plug-in ) 
> 
> 
> 
> But I am not able to query tsv/parquet file having deep hierarchy present in 
> container configured in storage plug-in. 
> 
> for example, when I query, /datalake/replicator/test/requisition_fact/,  it 
> always gives me following error 
> 
> 
> 
> Sep 13, 2017 11:39:00 PM org.apache.calcite.runtime.CalciteException 
> 
> SEVERE: org.apache.calcite.runtime.CalciteContextException: 
> 
> From line 1, column 15 to line 1, column 18: Table 
> 'oscs./datalake/replicator/test/test_requisition_fact' not found
> 
> Error: VALIDATION ERROR: From line 1, column 15 to line 1, column 18: Table 
> 'oscs./datalake/replicator/test/requisition_fact' not found
> 
> 
> 
> SQL Query null
> 
> 
> 
> [Error Id: ad55afb5-ea9f-4a9d-95f7-c5a08f712c73 on 
> slc05gmy.us.oracle.com:31010] (state=,code=0)
> 
> 
> 
> Any input on this will be useful.
> 
> 
> 
> Regards
> 
> Chetan
> 
> 



Re: Querying MapR-DB JSON Tables not returning results when specifying columns or CF's

2017-09-11 Thread Padma Penumarthy
Do you think this is a regression ?  Can you try with Drill 1.11 ?

Thanks,
Padma


> On Sep 11, 2017, at 10:21 AM, Andries Engelbrecht  
> wrote:
> 
> Created a MapR-DB JSON table, but not able to query data specifying column or 
> CF’s.
> 
> When doing a select * the data is returned.
> 
> i.e.
> 
> 0: jdbc:drill:> select * from dfs.maprdb.`/sdc/nycbike` b limit 1;
> ++---+---++-+-+-+---++---+-+---+-+--+-+-+--+--+---++-+---+
> |_id |  age  |  arc  | avg_speed_mph  | bikeid  | 
> birth year  | end station id  | end station latitude  | end station longitude 
>  | end station name  | gender  | start station id  | start station latitude  
> | start station longitude  | start station name  | start_date  |  
> starttime   |   stoptime   | tripduration  |   tripid 
>   |  usertype   |  station  |
> ++---+---++-+-+-+---++---+-+---+-+--+-+-+--+--+---++-+---+
> | 2017-04-01 00:00:58-25454  | 51.0  | 0.39  | 7.2| 25454   | 
> 1966.0  | 430 | 40.7014851| -73.98656928  
>  | York St & Jay St  | M   | 217   | 40.70277159 
> | -73.99383605 | Old Fulton St   | 2017-04-01  | 2017-04-01 
> 00:00:58  | 2017-04-01 00:04:14  | 195   | 2017-04-01 00:00:58-25454  
> | Subscriber  | {"end station id":"430"}  |
> ++---+---++-+-+-+---++---+-+---+-+--+-+-+--+--+---++-+---+
> 1 row selected (0.191 seconds)
> 
> 
> However trying to specify a column or CF name nothing is returned.
> 
> Specify a column name
> 
> 0: jdbc:drill:> select bikeid from dfs.maprdb.`/sdc/nycbike` b limit 10;
> +--+
> |  |
> +--+
> +--+
> No rows selected (0.067 seconds)
> 
> 0: jdbc:drill:> select b.bikeid from dfs.maprdb.`/sdc/nycbike` b limit 1;
> +--+
> |  |
> +--+
> +--+
> No rows selected (0.062 seconds)
> 
> 
> Specify a CF name the same result.
> 
> 0: jdbc:drill:> select b.station from dfs.maprdb.`/sdc/nycbike` b limit 1;
> +--+
> |  |
> +--+
> +--+
> No rows selected (0.063 seconds)
> 
> 
> Drill 1.10 and the user has full read/write/traverse permissions on the table.
> 
> 
> 
> 
> Thanks
> 
> Andries



Re: Workaround for drill queries during node failure

2017-09-11 Thread Padma Penumarthy
Did you mean to say “we could not execute any queries” ?

Need more details about configuration you have.
When you say data is available on other nodes, is it because you
have replication configured (assuming it is DFS) ?

What exactly are you trying and what error you see when you try to
execute the query ?

Thanks,
Padma


On Sep 11, 2017, at 9:40 AM, Kshitija Shinde 
> wrote:

Hi,

We have installed drill in distributed mode. While testing drillbit we have
observed that if one of node is done then we could execute any queries
against the drill even if data is available on other nodes.



Is there any workaround for this?



Thanks,

Kshitija



Re: Best way to partition the data

2017-09-01 Thread Padma Penumarthy
Have you tried building metadata cache file using "refresh table metadata” 
command ?
That will help reduce the planning time. Is most of the time spent in planning 
or execution ?

Pruning is done at  rowgroup level i.e. at file level (we create one file per 
rowgroup).
We do not support pruning at page level.
I am thinking if it created 50K files, it means your cardinality is high. You 
might want to
consider putting some directory hierarchy in place for ex. you can create a 
directory
for each unique value of column 1 and a file for each unique value of column 2 
underneath.
If partition is done correctly, depending upon the filters, we should not read 
more
rowgroups than what is needed.

Thanks,
Padma



On Sep 1, 2017, at 6:54 AM, Damien Profeta 
> wrote:

Hello,

I have a dataset that I always query on 2 columns that don't have a big 
cardinality. So to benefit from pruning, I tried to partition the file on these 
keys, but I end up with 50k differents small file (30Mo) and query on it spend 
most of the time in the planning phase, to decode the metadata file, resolve 
the absolute path…

By looking at the parquet file structure, I saw that there are statistics at 
page level and chunk level. So I tried to generated parquet file where a page 
is dedicated for one value for the 2 partition column. By using the statistics, 
Drill could be able to drop the page/chunk.
But it seems Drill is not making any use of the statistics in the parquet file 
because, whatever the query I do, I don't see any change in the number of page 
loaded.

Do you confirm my conclusion? What would be the best way to organize the data 
so that Drill doesn't read the data that can be pruned easily

Thanks
Damien



Re: Drill Profile page takes too much time to load

2017-08-29 Thread Padma Penumarthy
If these channel closed exceptions happen when you try to list profiles
using web UI, yes that could be related. 
one option is to change configuration to use a new directory for saving 
profiles.
You can delete the old profiles (if you don’t need them) or save them. 

Thanks,
Padma


> On Aug 29, 2017, at 4:05 PM, Rahul Raj <rahul@option3consulting.com> 
> wrote:
> 
> Thanks for the reply. I will try deleting the files manually.
> 
> I have a web application that pools connections to drill. Once this happens
> I can see channel closed exceptions and the entire pool stale. Could this
> be due to a large GC causing ZK heartbeat miss?
> 
> Regards,
> Rahul
> 
> On Tue, Aug 29, 2017 at 10:42 PM, Padma Penumarthy <ppenumar...@mapr.com>
> wrote:
> 
>> yes, we save each query profile as file in a single directory.
>> If there are large number of files in the directory, it can cause web UI
>> to hang or slow down.
>> That is because we try to list all the files in the directory when we want
>> to view the profiles from web UI and
>> that is an expensive operation.
>> We have some open JIRAs for this issue (DRILL-2861, DRILL-2362)
>> which we plan to address in the future.
>> For now, you have to delete these files manually if it is an issue for you.
>> 
>> Thanks,
>> Padma
>> 
>>> On Aug 29, 2017, at 7:46 AM, Rahul Raj <rahul@option3consulting.com>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> The drill profile list page(<>:8047/profiles) takes few minutes to
>> load
>>> in one of the installation.
>>> 
>>> There was a considerable amount of processor(20%) and memory(15-20%)
>> usage
>>> during this time. Immediately after displaying the results, values return
>>> to normal.
>>> 
>>> Could this be because of the values accumulated in profile storage? Is it
>>> possible to purge some data?
>>> 
>>> Regards,
>>> Rahul
>>> 
>>> --
>>>  This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom it is
>>> addressed. If you are not the named addressee then you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately and delete this e-mail from your system.
>> 
>> 
> 
> -- 
>  This email and any files transmitted with it are confidential and 
> intended solely for the use of the individual or entity to whom it is 
> addressed. If you are not the named addressee then you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately and delete this e-mail from your system.



Re: Querying Streaming Data using Drill

2017-08-29 Thread Padma Penumarthy
That’s great. Thanks for the update. Looking forward for the presentation.

Thanks,
Padma


> On Aug 29, 2017, at 11:48 AM, AnilKumar B <akumarb2...@gmail.com> wrote:
> 
> Hi Padma & Chetan,
> 
> Just wanted to update on https://issues.apache.org/jira/browse/DRILL-4779
> 
> We have developed and tested kafka integration and currently it's working
> for JSON messages. And currently we are working on test cases and Avro
> support.
> 
> We are planning to present this on Sept 18th Drill Developer's day.
> 
> Due to multiple reasons, this feature delayed from long time, but we are
> almost there.
> 
> Repo:
> https://github.com/akumarb2010/incubator-drill/tree/master/contrib/storage-kafka
> 
> 
> 
> 
> Thanks & Regards,
> B Anil Kumar.
> 
> On Tue, Aug 29, 2017 at 10:26 AM, Chetan Kothari <chetan.koth...@oracle.com>
> wrote:
> 
>> Thanks Padma for quick response.
>> 
>> 
>> 
>> This will be very critical feature to support in Drill as user will look
>> for
>> 
>> single SQL Engine which supports querying both batch and streaming data.
>> 
>> 
>> 
>> Any inputs on when support for querying streaming data will be supported?
>> 
>> 
>> 
>> Regards
>> 
>> Chetan
>> 
>> 
>> 
>> -Original Message-
>> From: Padma Penumarthy [mailto:ppenumar...@mapr.com]
>> Sent: Tuesday, August 29, 2017 10:53 PM
>> To: user@drill.apache.org
>> Subject: Re: Querying Streaming Data using Drill
>> 
>> 
>> 
>> Currently, we do not have support for these storage plugins.
>> 
>> I see an open JIRA for Kafka, not sure how much progress was made (as last
>> update was a while back).
>> 
>> 
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
>> apache.org_jira_browse_DRILL-2D4779=DwIFAg=
>> RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=XdafK486-
>> x96ZJoTcDF35EpcYc2j9nO0sbpt27_VGCg=bHLm9AOBdeCey8wjxCKXHxHTwlpzXu
>> JF3dhrqs-qX-0=UCGsbhOa9QLDAmUEfJYiq_cOfGg5dUi1RdcPB7YTMMQ=
>> 
>> 
>> 
>> Thanks,
>> 
>> Padma
>> 
>> 
>> 
>> 
>> 
>> On Aug 29, 2017, at 10:14 AM, Chetan Kothari > chetan.koth...@oracle.com%3cmailto:chetan.koth...@oracle.com;
>> chetan.koth...@oracle.com<mailto:chetan.koth...@oracle.com>> wrote:
>> 
>> 
>> 
>> Is there any support for querying streaming data using Drill?
>> 
>> 
>> 
>> Presto provides out-of-box Kafka and Amazon Kinesis Connectors for
>> querying streaming data.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Regards
>> 
>> 
>> 
>> Chetan
>> 
>> 
>> 
>> 
>> 



Re: Querying Streaming Data using Drill

2017-08-29 Thread Padma Penumarthy
Currently, we do not have support for these storage plugins.
I see an open JIRA for Kafka, not sure how much progress was made (as last 
update was a while back).

https://issues.apache.org/jira/browse/DRILL-4779

Thanks,
Padma


On Aug 29, 2017, at 10:14 AM, Chetan Kothari 
> wrote:

Is there any support for querying streaming data using Drill?

Presto provides out-of-box Kafka and Amazon Kinesis Connectors for querying 
streaming data.



Regards

Chetan



Re: Drill Profile page takes too much time to load

2017-08-29 Thread Padma Penumarthy
yes, we save each query profile as file in a single directory.
If there are large number of files in the directory, it can cause web UI to 
hang or slow down.
That is because we try to list all the files in the directory when we want to 
view the profiles from web UI and
that is an expensive operation. 
We have some open JIRAs for this issue (DRILL-2861, DRILL-2362) 
which we plan to address in the future.
For now, you have to delete these files manually if it is an issue for you.

Thanks,
Padma

> On Aug 29, 2017, at 7:46 AM, Rahul Raj  
> wrote:
> 
> Hi,
> 
> The drill profile list page(<>:8047/profiles) takes few minutes to load
> in one of the installation.
> 
> There was a considerable amount of processor(20%) and memory(15-20%) usage
> during this time. Immediately after displaying the results, values return
> to normal.
> 
> Could this be because of the values accumulated in profile storage? Is it
> possible to purge some data?
> 
> Regards,
> Rahul
> 
> -- 
>  This email and any files transmitted with it are confidential and 
> intended solely for the use of the individual or entity to whom it is 
> addressed. If you are not the named addressee then you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately and delete this e-mail from your system.



Re: Apache Drill unable to read files from HDFS (Resource error: Failed to create schema tree)

2017-08-23 Thread Padma Penumarthy
For HDFS, your storage plugin configuration should be something like this:

{
  "type": "file",
  "enabled": true,
  "connection": "hdfs://:”,   // IP address and port number 
of name node metadata service
  "config": null,
  "workspaces": {
"root": {
  "location": "/",
  "writable": true,
  "defaultInputFormat": null
},
"tmp": {
  "location": "/tmp",
  "writable": true,
  "defaultInputFormat": null
}
  },

Also, try hadoop dfs -ls command to see if you can list the files.

Thanks,
Padma


On Aug 23, 2017, at 12:18 PM, Lee, David 
> wrote:

HDFS storage plugin should be set to your HDFS name node url..

-Original Message-
From: Zubair, Muhammad [mailto:muhammad.zub...@rbc.com.INVALID]
Sent: Wednesday, August 23, 2017 11:33 AM
To: user@drill.apache.org
Subject: Apache Drill unable to read files from HDFS (Resource error: Failed to 
create schema tree)

Hello,
After setting up drill on one of the edge nodes of our HDFS cluster, I am 
unable to read any hdfs files. I can query data from local files (as long as 
they are in a folder that has 777 permissions) but querying data from hdfs 
fails with the following error:
Error: RESOURCE ERROR: Failed to create schema tree.
[Error Id: d9f7908c-6c3b-49c0-a11e-71c004d27f46 on server-name:31010] 
(state=,code=0)
Query:
0: jdbc:drill:zk=local> select * from hdfs.`/names/city.parquet` limit 2; 
Querying from local file works fine:
0: jdbc:drill:zk=local> select * from dfs.`/tmp/city.parquet` limit 2; My HDFS 
settings are similar to the DFS settings, except for the connection URL being 
the server address instead of file:/// I can't find anything online regarding 
this error for drill.
___
If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation 
pour les fins de reference future.


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2017 BlackRock, Inc. All rights reserved.



Re: Query Optimization

2017-08-21 Thread Padma Penumarthy
That is definitely not the design strategy. Also, I don’t think what you are 
seeing is same
as DRILL-3846.  The difference between with and without metadata caching is a
factor of 2-4 times in DRILL-3846 where as what you see is huge order of 
magnitude
different.

You should file a JIRA and include details that will help us reproduce the 
problem.
Please add as much information as possible.
A sample dataset, how you are creating the table (i.e. partition info), 
logs, query profiles will be very helpful.

Thanks,
Padma


> On Aug 20, 2017, at 7:03 PM, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
> Hi ,
> Yes As Rahul mentioned I am running into a bug
> https://issues.apache.org/jira/browse/DRILL-3846 ?
> 
> As asked the usedMetadataFile is true once I run the Metadata cache query .
> Any tentative or workaorund for the bug?
> 
> Now my ask is if metadata cache is enabled the does Drill reads all the
> files instead of intended ones ?
> Is it Drill design strategy ?
> 
> Thanks,
> divya
> 
> On 18 August 2017 at 12:13, Padma Penumarthy <ppenumar...@mapr.com> wrote:
> 
>> It is supposed to work like you expected. May be you are running into a
>> bug.
>> Why is it reading all files after metadata refresh ? That is difficult to
>> answer without
>> looking at the logs and query profile. If you look at the query profile,
>> you can may
>> be check what usedMetadataFile flag says for scan.
>> Also, I am thinking if you created so many files, your metadata
>> cache file could be big. May be you can manually sanity
>> check if it looks ok (look for .drill.parquet.metadata file in the root
>> directory) and not
>> corrupted ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfar...@mapr.com<mailto:kfara
>> a...@mapr.com>> wrote:
>> 
>> Please share your SQL query and the query plan.
>> 
>> To get the query plan, execute EXPLAIN PLAN FOR ;
>> 
>> 
>> Thanks,
>> 
>> Khurram
>> 
>> 
>> From: Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com
>>>> 
>> Sent: Friday, August 18, 2017 7:15:18 AM
>> To: user@drill.apache.org<mailto:user@drill.apache.org>
>> Subject: Re: Query Optimization
>> 
>> Hi ,
>> Yes its the same query its just the ran the metadata refresh command .
>> My understanding is metadata refresh command saves reading the metadata.
>> How about column values ... Why is it reading all the files after metedata
>> refresh ?
>> Partition helps to retrieve data faster .
>> Like in hive how it happens when you mention the partition column in where
>> condition
>> it just goes and read and improves the query performace .
>> In my query also I where conidtion has  partioning column it should go and
>> read those partitioned files right ?
>> Why is it taking more time ?
>> Does the Drill works in different way compare to hive ?
>> 
>> 
>> Thanks,
>> Divya
>> 
>> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumar...@mapr.com> ppenumar...@mapr.com>> wrote:
>> 
>> It might read all those files if some new data gets added after running
>> refresh metadata cache.
>> If everything is same before and after metadata refresh i.e. no
>> new data added and query is exactly the same, then it should not do that.
>> Also, check if you can partition in  a way that will not create so many
>> files in the
>> first place.
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htco...@gmail.com<
>> mailto:divya.htco...@gmail.com>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> 
>> Total number of parquet files in directory  - 102290
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> 
>> This is how the refresh metadata works I mean it scans each and every
>> files
>> and get the results ?
>> 
>> I dont  have access to logs now .
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:48, Divya Gehlot <divya.htco...@gmail.com> divya.htco...@gmail.com>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>

Re: Query Optimization

2017-08-17 Thread Padma Penumarthy
It is supposed to work like you expected. May be you are running into a bug.
Why is it reading all files after metadata refresh ? That is difficult to 
answer without
looking at the logs and query profile. If you look at the query profile, you 
can may
be check what usedMetadataFile flag says for scan.
Also, I am thinking if you created so many files, your metadata
cache file could be big. May be you can manually sanity
check if it looks ok (look for .drill.parquet.metadata file in the root 
directory) and not
corrupted ?

Thanks,
Padma


On Aug 17, 2017, at 8:10 PM, Khurram Faraaz 
<kfar...@mapr.com<mailto:kfar...@mapr.com>> wrote:

Please share your SQL query and the query plan.

To get the query plan, execute EXPLAIN PLAN FOR ;


Thanks,

Khurram


From: Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>>
Sent: Friday, August 18, 2017 7:15:18 AM
To: user@drill.apache.org<mailto:user@drill.apache.org>
Subject: Re: Query Optimization

Hi ,
Yes its the same query its just the ran the metadata refresh command .
My understanding is metadata refresh command saves reading the metadata.
How about column values ... Why is it reading all the files after metedata
refresh ?
Partition helps to retrieve data faster .
Like in hive how it happens when you mention the partition column in where
condition
it just goes and read and improves the query performace .
In my query also I where conidtion has  partioning column it should go and
read those partitioned files right ?
Why is it taking more time ?
Does the Drill works in different way compare to hive ?


Thanks,
Divya

On 18 August 2017 at 07:37, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:

It might read all those files if some new data gets added after running
refresh metadata cache.
If everything is same before and after metadata refresh i.e. no
new data added and query is exactly the same, then it should not do that.
Also, check if you can partition in  a way that will not create so many
files in the
first place.

Thanks,
Padma


On Aug 16, 2017, at 10:54 PM, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>>
wrote:

Hi,
Another observation is
My query had where conditions based on the partition values

Total number of parquet files in directory  - 102290
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files


This is how the refresh metadata works I mean it scans each and every
files
and get the results ?

I dont  have access to logs now .

Thanks,
Divya

On 17 August 2017 at 13:48, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>>
wrote:

Hi,
Another observation is
My query had where conditions based on the partition values
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files

Thanks,
Divya

On 17 August 2017 at 13:03, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>>
wrote:

Does your query have partition filter ?
Execution time is increased most likely because partition pruning is
not
happening.
Did you get a chance to look at the logs ?  That might give some clues.

Thanks,
Padma


On Aug 16, 2017, at 9:32 PM, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>>
wrote:

Hi,
Even I am surprised .
I am running Drill version 1.10  on MapR enterprise version.
*Query *- Selecting all the columns on partitioned parquet table

I observed few things from Query statistics :

Value

Before Refresh Metadata

After Refresh Metadata

Fragments

1

13

DURATION

01 min 0.233 sec

18 min 0.744 sec

PLANNING

59.818 sec

33.087 sec

QUEUED

Not Available

Not Available

EXECUTION

0.415 sec

17 min 27.657 sec

The planning time is being reduced by approx 60% but the execution
time
increased  drastically.
I would like to understand why the exceution time increases after the
metadata refresh .


Appreciate the help.

Thanks,
divya


On 17 August 2017 at 11:54, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>>
wrote:

Refresh table metadata should  help reduce query planning time.
It is odd that it went up after you did refresh table metadata.
Did you check the logs to see what is happening ? You might have to
turn on some debugs if needed.
BTW, what version of Drill are you running ?

Thanks,
Padma


On Aug 16, 2017, at 8:15 PM, Divya Gehlot 
<divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>>
wrote:

Hi,
I have data in parquet file format .
when I run the query the data and see the execution plan I could see
following
statistics

TOTAL FRAGMENTS: 1
DURATION: 01 min 0.233 sec
PLANNING: 59.818 sec
QUEUED: Not Available
EXECUTION: 0.415 sec



As its a paquet file format I tried enabling refresh meta data
and run below command
REFRESH TABLE METADATA  ;
then run the same query again on

Re: Query Optimization

2017-08-17 Thread Padma Penumarthy
It might read all those files if some new data gets added after running
refresh metadata cache. 
If everything is same before and after metadata refresh i.e. no 
new data added and query is exactly the same, then it should not do that.
Also, check if you can partition in  a way that will not create so many files 
in the
first place.

Thanks,
Padma


> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> 
> Total number of parquet files in directory  - 102290
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
> 
> 
> This is how the refresh metadata works I mean it scans each and every files
> and get the results ?
> 
> I dont  have access to logs now .
> 
> Thanks,
> Divya
> 
> On 17 August 2017 at 13:48, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumar...@mapr.com> wrote:
>> 
>>> Does your query have partition filter ?
>>> Execution time is increased most likely because partition pruning is not
>>> happening.
>>> Did you get a chance to look at the logs ?  That might give some clues.
>>> 
>>> Thanks,
>>> Padma
>>> 
>>> 
>>>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> Even I am surprised .
>>>> I am running Drill version 1.10  on MapR enterprise version.
>>>> *Query *- Selecting all the columns on partitioned parquet table
>>>> 
>>>> I observed few things from Query statistics :
>>>> 
>>>> Value
>>>> 
>>>> Before Refresh Metadata
>>>> 
>>>> After Refresh Metadata
>>>> 
>>>> Fragments
>>>> 
>>>> 1
>>>> 
>>>> 13
>>>> 
>>>> DURATION
>>>> 
>>>> 01 min 0.233 sec
>>>> 
>>>> 18 min 0.744 sec
>>>> 
>>>> PLANNING
>>>> 
>>>> 59.818 sec
>>>> 
>>>> 33.087 sec
>>>> 
>>>> QUEUED
>>>> 
>>>> Not Available
>>>> 
>>>> Not Available
>>>> 
>>>> EXECUTION
>>>> 
>>>> 0.415 sec
>>>> 
>>>> 17 min 27.657 sec
>>>> 
>>>> The planning time is being reduced by approx 60% but the execution time
>>>> increased  drastically.
>>>> I would like to understand why the exceution time increases after the
>>>> metadata refresh .
>>>> 
>>>> 
>>>> Appreciate the help.
>>>> 
>>>> Thanks,
>>>> divya
>>>> 
>>>> 
>>>> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumar...@mapr.com>
>>> wrote:
>>>> 
>>>>> Refresh table metadata should  help reduce query planning time.
>>>>> It is odd that it went up after you did refresh table metadata.
>>>>> Did you check the logs to see what is happening ? You might have to
>>>>> turn on some debugs if needed.
>>>>> BTW, what version of Drill are you running ?
>>>>> 
>>>>> Thanks,
>>>>> Padma
>>>>> 
>>>>> 
>>>>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htco...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> I have data in parquet file format .
>>>>>> when I run the query the data and see the execution plan I could see
>>>>>> following
>>>>>> statistics
>>>>>> 
>>>>>>> TOTAL FRAGMENTS: 1
>>>>>>>> DURATION: 01 min 0.233 sec
>>>>>>>> PLANNING: 59.818 sec
>>>>>>>> QUEUED: Not Available
>>>>>>>> EXECUTION: 0.415 sec
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> As its a paquet file format I tried enabling refresh meta data
>>>>>> and run below command
>>>>>> REFRESH TABLE METADATA  ;
>>>>>> then run the same query again on the same table same data (no changes
>>> in
>>>>>> data)  and could find the statistics as show below :
>>>>>> 
>>>>>> TOTAL FRAGMENTS: 13
>>>>>>>> DURATION: 14 min 14.604 sec
>>>>>>>> PLANNING: 33.087 sec
>>>>>>>> QUEUED: Not Available
>>>>>>>> EXECUTION: Not Available
>>>>>>> 
>>>>>>> 
>>>>>> The query is still running .
>>>>>> 
>>>>>> Can somebody help me  understand why the query taking so long once I
>>>>> issue
>>>>>> the refresh metadata command.
>>>>>> 
>>>>>> Aprreciate the help !
>>>>>> 
>>>>>> Thanks,
>>>>>> Divya
>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: Query Optimization

2017-08-16 Thread Padma Penumarthy
Does your query have partition filter ? 
Execution time is increased most likely because partition pruning is not 
happening.
Did you get a chance to look at the logs ?  That might give some clues.

Thanks,
Padma


> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
> 
> I observed few things from Query statistics :
> 
> Value
> 
> Before Refresh Metadata
> 
> After Refresh Metadata
> 
> Fragments
> 
> 1
> 
> 13
> 
> DURATION
> 
> 01 min 0.233 sec
> 
> 18 min 0.744 sec
> 
> PLANNING
> 
> 59.818 sec
> 
> 33.087 sec
> 
> QUEUED
> 
> Not Available
> 
> Not Available
> 
> EXECUTION
> 
> 0.415 sec
> 
> 17 min 27.657 sec
> 
> The planning time is being reduced by approx 60% but the execution time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
> 
> 
> Appreciate the help.
> 
> Thanks,
> divya
> 
> 
> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumar...@mapr.com> wrote:
> 
>> Refresh table metadata should  help reduce query planning time.
>> It is odd that it went up after you did refresh table metadata.
>> Did you check the logs to see what is happening ? You might have to
>> turn on some debugs if needed.
>> BTW, what version of Drill are you running ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>> 
>>> Hi,
>>> I have data in parquet file format .
>>> when I run the query the data and see the execution plan I could see
>>> following
>>> statistics
>>> 
>>>> TOTAL FRAGMENTS: 1
>>>>> DURATION: 01 min 0.233 sec
>>>>> PLANNING: 59.818 sec
>>>>> QUEUED: Not Available
>>>>> EXECUTION: 0.415 sec
>>>> 
>>>> 
>>> 
>>> As its a paquet file format I tried enabling refresh meta data
>>> and run below command
>>> REFRESH TABLE METADATA  ;
>>> then run the same query again on the same table same data (no changes in
>>> data)  and could find the statistics as show below :
>>> 
>>> TOTAL FRAGMENTS: 13
>>>>> DURATION: 14 min 14.604 sec
>>>>> PLANNING: 33.087 sec
>>>>> QUEUED: Not Available
>>>>> EXECUTION: Not Available
>>>> 
>>>> 
>>> The query is still running .
>>> 
>>> Can somebody help me  understand why the query taking so long once I
>> issue
>>> the refresh metadata command.
>>> 
>>> Aprreciate the help !
>>> 
>>> Thanks,
>>> Divya
>> 
>> 



Re: Query Optimization

2017-08-16 Thread Padma Penumarthy
Refresh table metadata should  help reduce query planning time.
It is odd that it went up after you did refresh table metadata.
Did you check the logs to see what is happening ? You might have to
turn on some debugs if needed.
BTW, what version of Drill are you running ?

Thanks,
Padma


> On Aug 16, 2017, at 8:15 PM, Divya Gehlot  wrote:
> 
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
> 
>> TOTAL FRAGMENTS: 1
>>> DURATION: 01 min 0.233 sec
>>> PLANNING: 59.818 sec
>>> QUEUED: Not Available
>>> EXECUTION: 0.415 sec
>> 
>> 
> 
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA  ;
> then run the same query again on the same table same data (no changes in
> data)  and could find the statistics as show below :
> 
> TOTAL FRAGMENTS: 13
>>> DURATION: 14 min 14.604 sec
>>> PLANNING: 33.087 sec
>>> QUEUED: Not Available
>>> EXECUTION: Not Available
>> 
>> 
> The query is still running .
> 
> Can somebody help me  understand why the query taking so long once I issue
> the refresh metadata command.
> 
> Aprreciate the help !
> 
> Thanks,
> Divya



Re: Writing to the connected data source?

2017-08-12 Thread Padma Penumarthy
Workspaces are supported only for file system based storage plugins (DFS, S3).
You can create tables in workspaces, by configuring workspaces as writable. 
Table can be created as file(s) of type parquet, json, psv, csv and tsv, based 
on configuration option “store.format".
For Mongo, creating table is not supported.

Thanks,
Padma


> On Aug 11, 2017, at 8:43 PM, Akshat Jiwan Sharma  
> wrote:
> 
> Hello Everyone,
> 
> My use case is that I'd like to be able to use Drill as a proxy for my data 
> sources, not only for querying the data but also for writing the data.
> Is it possible to write data to the connected data store using apache drill 
> rest api? Or perhaps with a custom function 
> ?
> 
> As I understand there is a CTAS 
>  function allows us to 
> create new tables but there are a few restrictions
> 
> >"You cannot create tables using storage plugins, such as Hive and HBase, 
> >that do not specify a workspace."
> 
> Does this mean that there are some storage plugins that do specify a 
> workspace. I looked at the documentation of mongodb storage plugin and wasn't 
> sure if it did support workspaces? How can I determine if a storage plugin is 
> writeable or not?
> 
> Thanks,
> Akshat



Re: Drill on secure HDFS

2017-08-12 Thread Padma Penumarthy
Did you look at this ?
https://drill.apache.org/docs/configuring-kerberos-authentication/

Also, this JIRA has some details.
https://issues.apache.org/jira/browse/DRILL-3584

Thanks,
Padma


On Aug 12, 2017, at 5:04 AM, Ascot Moss 
> wrote:

Does Drill's File Storage Plugin (DFS Plugin) support Hadoop-Kerberos? I
cannot find any link about this.

https://drill.apache.org/docs/file-system-storage-plugin/
This document does not provide to much detail about whether Drill supports
Hadoop-Kerberos or not.

On Sat, Aug 12, 2017 at 4:09 PM, Ascot Moss  wrote:

Hi,

I have HDFS with Kerberos enabled, how to configure drill's dfs plugin in
order to enable Kerberos authentication?

regards




Re: unable to run drill sql in shell script

2017-08-02 Thread Padma Penumarthy
Did you look at this link ?

https://drill.apache.org/docs/starting-drill-on-windows/


It says, to start drill on windows, do this:
sqlline.bat -u "jdbc:drill:zk=local"


Try that and see.

Thanks,
Padma


From: Divya Gehlot 
Sent: Wednesday, August 2, 2017 8:50 PM
To: user@drill.apache.org
Subject: unable to run drill sql in shell script

Hi,
I have installed Drill on windows Embedded mode.
Have a git Bash installed and I am trying to start the sqlline through the
shell script
Below is script which I am using :

> SQLLINE="/c/ApacheDrill/apache-drill-1.11.0.tar/apache-drill-1.11.0/bin/drill-localhost
> -u \"jdbc:drill:zk=localhost:2181\"";
>  $SQLLINE


When I run above shell script it gives me below error :

> Running query from bash
> Error: Could not find or load main class sqlline.SqlLine
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> MaxPermSize=512M; support was removed in 8.0


Options which I have tried are :
1.sqlline
2.drill-localhost
3.drill-embedded

All gives the same error as above

Appreciate the help.


Thanks,
Divya


Re: System Error PhysicalOperatorSetupException: Can not parallelize fragment

2017-08-02 Thread Padma Penumarthy

Actually, the way parallelization works is documented in this JIRA,

https://issues.apache.org/jira/browse/DRILL-4706


Essentially, if we have screen and scan in the same fragment, affinity for the 
fragment will be hard. So, that is not the problem here. Problem seem to be in 
how the groupScan is implemented. Based on logs, both endpoints are marked as 
mandatory (is that needed ?) and at the same time, max parallelization seem to 
have been configured as 1 (getMaxParallelizationWidth() in groupScan).


Thanks,
Padma


From: Jinfeng Ni 
Sent: Wednesday, August 2, 2017 11:30 AM
To: user
Subject: Re: System Error PhysicalOperatorSetupException: Can not parallelize 
fragment

Are you querying a new storage plugin (I saw the stack trace shows
"indexR"), and that storage plugin is using
HardAffinityFragmentParallelizer? As far as I know, the hard assignment is
only used for distributed system tables, or some special operator (Screen)
in Drill codebase. The majority cases use soft one.  Hard assignment means
to assign at least one minor fragment to each available drill endpoint.
For instance, screen has to assign to the foreman node, while distributed
system tables requires at least one minor fragment for each drillbit.

In your case, seems you have two drillbits, while the fragment # is only 1.
That fails the requirement for hard assignment.  Either you choose to use
soft assignment, or make # of drillbits <= # of fragments.


On Wed, Aug 2, 2017 at 1:17 AM, 何建军  wrote:

> Hi , nice to meet you .
> thinks to read this email.
> when used drill to Query, meets some error very confused me, and i don't
> know how to solve it ,hope you could check and help me thinks a lot .
>
>
> to find any useful message  ,so i modify some class to print error.
> here is my modify :
> drill-java-exec-1.9.0.jar
> org.apache.drill.exec.planner.fragment.HardAffinityFragmentParalleliz
> er.java
> line 80
> from
> checkOrThrow(endpointPool.size() <= width, logger,
> "Number of mandatory endpoints ({}) that require an assignment is
> more than the allowed fragment max " +
> "width ({}).", endpointPool.size(), pInfo.getMaxWidth());
> to
> checkOrThrow(endpointPool.size() <= width, logger,
> "Number of mandatory endpoints ({}) that require an assignment
> is more than the allowed fragment max " +
> "width ({}), fragment: {}, endpointPool: {},
> pInfo.getEndpointAffinityMap(): {}.",
> endpointPool.size(),
> pInfo.getMaxWidth(),
> fragmentWrapper.getNode(),
> org.apache.commons.lang3.StringUtils.join(
> endpointPool.values(), ","),
> org.apache.commons.lang3.StringUtils.join(pInfo.
> getEndpointAffinityMap().values(), ","));
>
>
> here is the meesage of error log
>
>
>
>
> [Error Id: 258f549e-d618-4dcb-994a-3e3908346033 on slave1:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> PhysicalOperatorSetupException: Can not parallelize fragment.
>
>
>
>
> [Error Id: 258f549e-d618-4dcb-994a-3e3908346033 on slave1:31010]
> at org.apache.drill.common.exceptions.UserException$
> Builder.build(UserException.java:543) ~[drill-common-1.9.0.jar:1.9.0]
> at 
> org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:825)
> [drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:935)
> [drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:281)
> [drill-java-exec-1.9.0.jar:1.9.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_121]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_121]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> Caused by: org.apache.drill.exec.work.foreman.ForemanException:
> Unexpected exception during fragment initialization: Can not parallelize
> fragment.
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:282)
> [drill-java-exec-1.9.0.jar:1.9.0]
> ... 3 common frames omitted
> Caused by: org.apache.drill.exec.physical.PhysicalOperatorSetupException:
> Can not parallelize fragment.
> at org.apache.drill.exec.planner.fragment.HardAffinityFragmentParalleliz
> er.checkOrThrow(HardAffinityFragmentParallelizer.java:162)
> ~[drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.planner.fragment.HardAffinityFragmentParalleliz
> er.parallelizeFragment(HardAffinityFragmentParallelizer.java:89)
> ~[drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> parallelizeFragment(SimpleParallelizer.java:253)
> ~[drill-java-exec-1.9.0.jar:1.9.0]
> at org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> getFragmentsHelper(SimpleParallelizer.java:167)
> ~[drill-java-exec-1.9.0.jar:1.9.0]
> at 
> 

Re: Drill performance tuning parquet

2017-07-31 Thread Padma Penumarthy
There is one more thing you need to consider, memory.  In general, adding more 
CPUs is better than adding nodes as long as you don't hit other bottlenecks.  
Once that happens, you might consider adding nodes.


Thanks,

Padma


From: Dan Holmes 
Sent: Monday, July 31, 2017 5:23 AM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet

When is it right to add nodes vs adding CPUs?  Since my installation is on AWS, 
adding CPUs is relatively easy.  When does it make sense to add nodes instead 
of CPUs?

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com
[http://revenueanalytics.com/wp-content/uploads/2017/03/revenue-analytics-logo.jpg]

Revenue Analytics : Revenue Management, Price Optimization 
...
www.revenueanalytics.com
Discover Revenue Management techniques, demand forecasting methods and price 
optimization techniques to exceed sales and profit goals. Increase revenue and 
profit ...




-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com]
Sent: Friday, July 28, 2017 12:51 PM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet

I also forgot to mention... within the drill-override.conf, is a parameter 
you'd need to set to constrain the async parquet reader's scan pool size. If 
you have just 4 cores, the pool's 4 threads will compete with your other 
fragments for CPU. Of course, all of this depends on what the metrics in query 
profile reveal.

-Original Message-
From: Jinfeng Ni [mailto:j...@apache.org]
Sent: Friday, July 28, 2017 7:41 AM
To: user 
Subject: Re: Drill performance tuning parquet

The number you posted seems to show that the query elapse time is highly 
impacted by the number of scan minor fragments (scan parallelization degree).

In Drill, scan parallelization degree is capped at minimum of # of parquet row 
groups, or 70% of cpu cores. In your original configuration, since you only 
have 3 file each with one row group, Drill will have only up to 3 scan minor 
fragments ( you can confirm that by looking at the query profile).
With decreased blocksize, you have more parquet files, and hence higher scan 
parallelization degree, and better performance. In the case of 4 cores, the 
scan parallelization degree is capped at 4*70% = 2, which probably explains why 
reducing blocksize does not help.

The 900MB total parquet  file size is relatively small. If you want to turn 
Drill for such small dataset, you probably need smaller parquet file size.
In the case of 4 cores, you may consider bump up the following parameter.

`planner.width.max_per_node`





On Fri, Jul 28, 2017 at 7:03 AM, Dan Holmes 
wrote:

> Thank you for the tips.  I have used 4 different block sizes.   It appears
> to scale linearly and anything less than the 512 blocksize was of
> similar performance.  I rounded the numbers to whole seconds.  The
> data is local to the EC2 instance; I did not put the data on EFS.  I
> used the same data files.  After I created it the first time I put the
> data on s3 and copied it to the others.
>
> If there are other configurations that someone is interested in, I
> would be willing to try them out.  I have something to gain in that too.
>
> Here's the data for the interested.
>
> vCPU x Blocksize
> 64  128 256 512
> m3.xlarge - 16  6   6   5   12
> c3.2xlarge - 8  11  11  11  20
> c4.4xlarge - 4  20  20  20  20
>
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Kunal Khatua [mailto:kkha...@mapr.com]
> Sent: Friday, July 28, 2017 2:38 AM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> Look at the query profile's (in the UI) "operator  profiles - overview"
> section. The % Query Time is a good indicator of which operator
> consumes the most CPU. Changing the planner.width.max_per_node
> actually will affect this (positively or negatively, depending on the load).
>
> Within the same Operator Profile, also look at Average and Max Wait times.
> See if the numbers are unusually high.
>
> Scrolling further down, (since you are working with parquet) the
> Parquet RowGroup Scan operator can be expanded to show the minor
> fragment (worker/leaf fragments) metrics. Since you have 3 files, you
> probably will see only 3 entries... since each fragment will scan 1
> row group in the parquet file. (I'm making the assumption that u have
> only 1 rowgroup per file). Just at the end of that table, you'll see
> "OperatorMetrics". This gives you the time (in nanosec) and other
> metrics it takes for these fragments to be handed data by the pool of Async 
> Parquet Reader threads.
>
> Most 

Re: System Error PhysicalOperatorSetupException: Can not parallelize fragment

2017-07-31 Thread Padma Penumarthy
Looking at the log you provided, seems like you are working on indexr plugin ? 
You probably have a groupScan implemented for the plugin.


What this error is saying is

you want to have minor fragments on 2 nodes you have (slave1 and slave2) 
because they are marked as mandatory , however you configured your overall max 
parallelization as only 1.


You should check how you are configuring/calculating endpoint affinity for your 
groupScan. Both endpoints are marked as mandatory. Also, for slave2, the values 
do not look correct. Typically, affinity value should be <= 1.


EndpointAffinity [endpoint=address: "slave1" user_port: 31010 control_port: 
31011 data_port: 31012, affinity=1.0, mandatory=true, 
maxWidth=1],EndpointAffinity [endpoint=address: "slave2" user_port: 31010 
control_port: 31011 data_port: 31012, affinity=3616256.0, mandatory=true, 
maxWidth=2147483647]


Overall max parallelization is obtained from getMaxParallelizationWidth() you 
implemented for your groupScan. Seems like you are returning 1 for that which 
contradicts with how you are configuring the endpoint affinities.


You can look at other groupScan implementations (Parquet or HBase) to see how 
this should be done.



Thanks,

Padma


From: 何建军 
Sent: Sunday, July 30, 2017 6:33:04 PM
To: user@drill.apache.org
Subject: System Error PhysicalOperatorSetupException: Can not parallelize 
fragment

Hi , nice to meet you .
thinks to read this email.
when used drill to Query, meets some error very confused me, and i don't know 
how to solve it ,hope you could check and help me thinks a lot .


to find any useful message  ,so i modify some class to print error.
here is my modify :
drill-java-exec-1.9.0.jar
org.apache.drill.exec.planner.fragment.HardAffinityFragmentParallelizer.java
line 80
from
checkOrThrow(endpointPool.size() <= width, logger,
"Number of mandatory endpoints ({}) that require an assignment is more 
than the allowed fragment max " +
"width ({}).", endpointPool.size(), pInfo.getMaxWidth());
to
checkOrThrow(endpointPool.size() <= width, logger,
"Number of mandatory endpoints ({}) that require an assignment is 
more than the allowed fragment max " +
"width ({}), fragment: {}, endpointPool: {}, 
pInfo.getEndpointAffinityMap(): {}.",
endpointPool.size(),
pInfo.getMaxWidth(),
fragmentWrapper.getNode(),
org.apache.commons.lang3.StringUtils.join( endpointPool.values(), 
","),

org.apache.commons.lang3.StringUtils.join(pInfo.getEndpointAffinityMap().values(),
 ","));


here is the meesage of error log




[Error Id: 258f549e-d618-4dcb-994a-3e3908346033 on slave1:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
PhysicalOperatorSetupException: Can not parallelize fragment.




[Error Id: 258f549e-d618-4dcb-994a-3e3908346033 on slave1:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543)
 ~[drill-common-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:825)
 [drill-java-exec-1.9.0.jar:1.9.0]
at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:935) 
[drill-java-exec-1.9.0.jar:1.9.0]
at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:281) 
[drill-java-exec-1.9.0.jar:1.9.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_121]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: org.apache.drill.exec.work.foreman.ForemanException: Unexpected 
exception during fragment initialization: Can not parallelize fragment.
at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:282) 
[drill-java-exec-1.9.0.jar:1.9.0]
... 3 common frames omitted
Caused by: org.apache.drill.exec.physical.PhysicalOperatorSetupException: Can 
not parallelize fragment.
at 
org.apache.drill.exec.planner.fragment.HardAffinityFragmentParallelizer.checkOrThrow(HardAffinityFragmentParallelizer.java:162)
 ~[drill-java-exec-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.planner.fragment.HardAffinityFragmentParallelizer.parallelizeFragment(HardAffinityFragmentParallelizer.java:89)
 ~[drill-java-exec-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.planner.fragment.SimpleParallelizer.parallelizeFragment(SimpleParallelizer.java:253)
 ~[drill-java-exec-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.planner.fragment.SimpleParallelizer.getFragmentsHelper(SimpleParallelizer.java:167)
 ~[drill-java-exec-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.planner.fragment.SimpleParallelizer.getFragments(SimpleParallelizer.java:126)
 ~[drill-java-exec-1.9.0.jar:1.9.0]
at 
org.apache.drill.exec.work.foreman.Foreman.getQueryWorkUnit(Foreman.java:596) 
[drill-java-exec-1.9.0.jar:1.9.0]
at 

Re: [HANGOUT] Topics for 7/25/17

2017-07-24 Thread Padma Penumarthy
I have a topic to discuss. Lot of folks on the user mailing list raised
the issue of not being able to access all S3 regions using Drill.
We need hadoop version 2.8 or higher to be able to connect to
regions which support only Version 4 signature.
I tried with 2.8.1, which just got released and it works i.e. I am able to
connect to both old and new regions (by specifying the endpoint in the config).
There are some failures in unit tests, which can be fixed.

Fixing S3 connectivity issues is important.
However, the hadoop release notes for 2.8.1 (and 2.8.0 as well) say the 
following:
"Please note that 2.8.x release line continues to be not yet ready for 
production use”.

So, should we or not move to 2.8.1 ?

Thanks,
Padma


On Jul 24, 2017, at 9:46 AM, Arina Yelchiyeva 
> wrote:

Hi all,

We'll have the hangout tomorrow at the usual time [1]. Any topics to be
discussed?

[1] https://drill.apache.org/community-resources/

Kind regards
Arina



Re: Increasing store.parquet.block-size

2017-06-14 Thread Padma Penumarthy
Sure. I will check and try to fix them as well.

Thanks,
Padma

> On Jun 14, 2017, at 3:12 AM, Khurram Faraaz <kfar...@mapr.com> wrote:
> 
> Thanks Padma. There are some more related failures reported in DRILL-2478, do 
> you think we should fix them too, if it is an easy fix.
> 
> 
> Regards,
> 
> Khurram
> 
> ________
> From: Padma Penumarthy <ppenumar...@mapr.com>
> Sent: Wednesday, June 14, 2017 11:43:16 AM
> To: user@drill.apache.org
> Subject: Re: Increasing store.parquet.block-size
> 
> I think you meant MB (not GB) below.
> HDFS allows creation of very large files(theoretically, there is no limit).
> I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that 
> is not recommended.
> 
> Anyways, we should not let the user be able to set any value and later throw 
> an error.
> I opened a PR to fix this.
> https://github.com/apache/drill/pull/852
> 
> Thanks,
> Padma
> 
> 
> On Jun 9, 2017, at 11:36 AM, Kunal Khatua 
> <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote:
> 
> The ideal size depends on what engine is consuming the parquet files (Drill, 
> i'm guessing) and the storage layer. For HDFS, which is usually 
> 128-256GB, we recommend to bump it to about 512GB (with the underlying HDFS 
> blocksize to match that).
> 
> 
> You'll probably need to experiment a little with different blocks sizes 
> stored on S3 to see which works the best.
> 
> <http://www.mapr.com/>
> 
> 
> From: Shuporno Choudhury 
> <shuporno.choudh...@manthan.com<mailto:shuporno.choudh...@manthan.com>>
> Sent: Friday, June 9, 2017 11:23:37 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Increasing store.parquet.block-size
> 
> Thanks for the information Kunal.
> After the conversion, the file size scales down to half if I use gzip
> compression.
> For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
> (using gzip compression).
> So, if I have to make multiple parquet files, what block size would be
> optimal, if I have to read the file later?
> 
> On 09-Jun-2017 11:28 PM, "Kunal Khatua" 
> <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote:
> 
> 
> If you're storing this in S3... you might want to selectively read the
> files as well.
> 
> 
> I'm only speculating, but if you want to download the data, downloading as
> a queue of files might be more reliable than one massive file. Similarly,
> within AWS, it *might* be faster to have an EC2 instance access a couple of
> large Parquet files versus one massive Parquet file.
> 
> 
> Remember that when you create a large block size, Drill tries to write
> everything within a single row group for each. So there is no chance of
> parallelization of the read (i.e. reading parts in parallel). The defaults
> should work well for S3 as well, and with the compression (e.g. Snappy),
> you should get a reasonably smaller file size.
> 
> 
> With the current default settings... have you seen what Parquet file sizes
> you get with Drill when converting your 10GB CSV source files?
> 
> 
> 
> From: Shuporno Choudhury 
> <shuporno.choudh...@manthan.com<mailto:shuporno.choudh...@manthan.com>>
> Sent: Friday, June 9, 2017 10:50:06 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Increasing store.parquet.block-size
> 
> Thanks Kunal for your insight.
> I am actually converting some .csv files and storing them in parquet format
> in s3, not in HDFS.
> The size of the individual .csv source files can be quite huge (around
> 10GB).
> So, is there a way to overcome this and create one parquet file or do I
> have to go ahead with multiple parquet files?
> 
> On 09-Jun-2017 11:04 PM, "Kunal Khatua" 
> <kkha...@mapr.com<mailto:kkha...@mapr.com>> wrote:
> 
> Shuporno
> 
> 
> There are some interesting problems when using Parquet files > 2GB on
> HDFS.
> 
> 
> If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
> enough) returns an int value. Large Parquet blocksize also means you'll
> end
> up having the file span across multiple HDFS blocks, and that would make
> reading of rowgroups inefficient.
> 
> 
> Is there a reason you want to create such a large parquet file?
> 
> 
> ~ Kunal
> 
> 
> From: Vitalii Diravka 
> <vitalii.dira...@gmail.com<mailto:vitalii.dira...@gmail.com>>
> Sent: Friday, June 9, 2017 4:49:02 AM
> To: user@drill.apache.org<mailto:user@drill.apac

Re: Increasing store.parquet.block-size

2017-06-14 Thread Padma Penumarthy
I think you meant MB (not GB) below.
HDFS allows creation of very large files(theoretically, there is no limit).
I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is 
not recommended.

Anyways, we should not let the user be able to set any value and later throw an 
error.
I opened a PR to fix this.
https://github.com/apache/drill/pull/852

Thanks,
Padma


On Jun 9, 2017, at 11:36 AM, Kunal Khatua 
> wrote:

The ideal size depends on what engine is consuming the parquet files (Drill, 
i'm guessing) and the storage layer. For HDFS, which is usually 128-256GB, 
we recommend to bump it to about 512GB (with the underlying HDFS blocksize to 
match that).


You'll probably need to experiment a little with different blocks sizes stored 
on S3 to see which works the best.




From: Shuporno Choudhury 
>
Sent: Friday, June 9, 2017 11:23:37 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks for the information Kunal.
After the conversion, the file size scales down to half if I use gzip
compression.
For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
(using gzip compression).
So, if I have to make multiple parquet files, what block size would be
optimal, if I have to read the file later?

On 09-Jun-2017 11:28 PM, "Kunal Khatua" 
> wrote:


If you're storing this in S3... you might want to selectively read the
files as well.


I'm only speculating, but if you want to download the data, downloading as
a queue of files might be more reliable than one massive file. Similarly,
within AWS, it *might* be faster to have an EC2 instance access a couple of
large Parquet files versus one massive Parquet file.


Remember that when you create a large block size, Drill tries to write
everything within a single row group for each. So there is no chance of
parallelization of the read (i.e. reading parts in parallel). The defaults
should work well for S3 as well, and with the compression (e.g. Snappy),
you should get a reasonably smaller file size.


With the current default settings... have you seen what Parquet file sizes
you get with Drill when converting your 10GB CSV source files?



From: Shuporno Choudhury 
>
Sent: Friday, June 9, 2017 10:50:06 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Thanks Kunal for your insight.
I am actually converting some .csv files and storing them in parquet format
in s3, not in HDFS.
The size of the individual .csv source files can be quite huge (around
10GB).
So, is there a way to overcome this and create one parquet file or do I
have to go ahead with multiple parquet files?

On 09-Jun-2017 11:04 PM, "Kunal Khatua" 
> wrote:

Shuporno


There are some interesting problems when using Parquet files > 2GB on
HDFS.


If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
enough) returns an int value. Large Parquet blocksize also means you'll
end
up having the file span across multiple HDFS blocks, and that would make
reading of rowgroups inefficient.


Is there a reason you want to create such a large parquet file?


~ Kunal


From: Vitalii Diravka 
>
Sent: Friday, June 9, 2017 4:49:02 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Khurram,

DRILL-2478 is a good place holder for the LongValidator issue, it really
works wrong.

But other issue connected to impossibility to use long values for parquet
block-size.
This issue can be independent task or a sub-task of updating Drill
project
to a latest parquet library.

Kind regards
Vitalii

On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz 
>
wrote:

 1.  DRILL-2478 is
Open for this issue.
 2.  I have added more details into the comments.

Thanks,
Khurram


From: Shuporno Choudhury 
>
Sent: Friday, June 9, 2017 12:48:41 PM
To: user@drill.apache.org
Subject: Increasing store.parquet.block-size

The max value that can be assigned to *store.parquet.block-size *is
*2147483647*, as the value kind of this configuration parameter is
LONG.
This basically translates to 2GB of block size.
How do I increase it to 3/4/5 GB ?
Trying to set this parameter to a higher value using the following
command
actually succeeds :
   ALTER SYSTEM SET 

Re: [HANGOUT] Topics for 6/12/17

2017-06-13 Thread Padma Penumarthy
Thank you all for attending the hangout today.

Here are the meeting minutes:

Muhammed asked question about how the client chooses which drillbit to connect.
Client gets the information about available drillbits from zookeeper and
just does round robin to select the node for a query. That drillbit becomes the 
foreman for the query.

Paul discussed in detail the memory fragmentation problem we have in drill and 
how we
are planning to address that. It was very insightful. Thanks Paul.
You can refer to DRILL-5211<https://issues.apache.org/jira/browse/DRILL-5211> 
for more details. It has links to background information and design docs.
Please ask questions and provide comments in the JIRA.

For the next meeting, we can choose another topic like this and go in detail.

Thanks,
Padma


On Jun 12, 2017, at 10:21 AM, Padma Penumarthy 
<ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote:


Drill hangout will be tomorrow, 10 AM PST.

In the last hangout, we talked about discussing one of the ongoing Drill 
projects in detail.
Please let me know who wants to volunteer to discuss the topic they are working 
on -
memory fragmentation, spill to disk for hash agg, external sort and schema 
change.

Also, please let me know if you have any topics you want to discuss by 
responding to this email.
We will also ask for topics at the beginning of the hangout.

Thanks,
Padma



[HANGOUT] Topics for 6/12/17

2017-06-12 Thread Padma Penumarthy

Drill hangout will be tomorrow, 10 AM PST.

In the last hangout, we talked about discussing one of the ongoing Drill 
projects in detail.
Please let me know who wants to volunteer to discuss the topic they are working 
on - 
 memory fragmentation, spill to disk for hash agg, external sort and schema 
change.

Also, please let me know if you have any topics you want to discuss by 
responding to this email. 
We will also ask for topics at the beginning of the hangout.

Thanks,
Padma

Re: Partitioning for parquet

2017-05-31 Thread Padma Penumarthy
Are you running same query on both tables ? What is the filter condition ?
Since they are partitioned differently, same filter may prune the files 
differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from disk
in both cases.

Thanks,
Padma


> On May 31, 2017, at 6:15 PM, Raz Baluchi  wrote:
> 
> As an experiment, I created an event file will 100 million entries spanning
> 25 years. I then created tables both ways, one partitioned by year and
> month and the other by date. The first table created 410 parquet files and
> the second 11837.
> 
> Querying the first table is consistently faster by a factor of 2x to 10x,
> 
> Is this because drill is not very efficient at querying a large number of
> small(ish) parquet files?
> 
> On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
> 
>> If most of your queries use date column in the filter condition, I would
>> partition the data on the date column. Then you can simply say
>> 
>> select * from events where `date` between '2016-11-11' and '2017-01-23';
>> 
>> - Rahul
>> 
>> On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi 
>> wrote:
>> 
>>> So, if I understand you correctly, I would have to include the 'yr' and
>>> 'mnth' columns in addition to the 'date' column in the query?
>>> 
>>> e.g.
>>> 
>>> select * from events where yr in (2016, 2017)  and mnth in (11,12,1) and
>>> date between '2016-11-11' and '2017-01-23';
>>> 
>>> Is that correct?
>>> 
>>> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
>>> challapallira...@gmail.com> wrote:
>>> 
 How to partition data is dependent on how you want to access your data.
>>> If
 you can foresee that most of the queries use year and month, then
>>> go-ahead
 and partition the data on those 2 columns. You can do that like below
 
 create table partitioned_data partition by (yr, mnth) as select
 extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
  from mydata;
 
 For partitioning to have any benefit, your queries should have filters
>> on
 month and year columns.
 
 - Rahul
 
 On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi 
 wrote:
 
> Hi all,
> 
> Trying to understand parquet partitioning works.
> 
> What is the recommended partitioning scheme for event data that will
>> be
> queried primarily by date. I assume that partitioning by year and
>> month
> would be optimal?
> 
> Lets say I have data that looks like:
> 
> application,status,date,message
> kafka,down,2017-03023 04:53,zookeeper is not available
> 
> 
> Would I have to create new columns for year and month?
> 
> e.g.
> application,status,date,message,year,month
> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
> 
> and then perform a CTAS using the year and month columns as the
 'partition
> by'?
> 
> Thanks
> 
 
>>> 
>> 



Re: creating tables in S3

2017-05-15 Thread Padma Penumarthy
I am wondering if location information in plugin configuration  should be

 "location": “/drill-tmp” (instead of “location":”drill-tmp”)


Thanks,
Padma


> On May 15, 2017, at 1:34 PM, Charles Givre  wrote:
> 
> Hi Michael,
> A few questions:
> 1.  Does the origin query work?
> 
> SELECT COLUMNS[0] x, COLUMNS[1] y FROM s3.`path/to/my.tbl`
> 
> One thing that jumps out at me is that I think "columns" has to be lower
> case.
> 
> 2. Did you set up the .tbl extension to read pipe separated files?  That
> also could be causing problems if not.
> 
> -- C
> 
> 
> On Mon, May 15, 2017 at 4:20 PM, Knapp, Michael <
> michael.kn...@capitalone.com> wrote:
> 
>> Hi,
>> 
>> So I have a directory full of pipe separated value files.  I was hoping to
>> convert these to parquet using Drill’s CTAS command. I tried this:
>> 
>> create table s3.tmp.`my_table` (x, y) as SELECT COLUMNS[0] x, COLUMNS[1] y
>> FROM s3.`path/to/my.tbl`
>> 
>> after a little time, I get this error:
>> 
>> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR:
>> Schema [s3.tmp] is not valid with respect to either root schema or current
>> default schema. Current default schema: No default schema selected
>> 
>> I am able to query data from that S3 file.
>> 
>> This is my S3 plugin configuration:
>> 
>> {
>>  "type": "file",
>>  "enabled": true,
>>  "connection": "s3a://my_bucket",
>>  "config": null,
>>  "workspaces": {
>>"root": {
>>  "location": "/",
>>  "writable": false,
>>  "defaultInputFormat": null
>>},
>>"tmp": {
>>  "location": "drill-tmp",
>>  "writable": true,
>>  "defaultInputFormat": null
>>}
>>  },
>> 
>> 
>> I have created the directory “drill-tmp” in my bucket, and it is empty.
>> 
>> The file is a pipe separated value, so it does not have a schema.  Does
>> anybody know what I’m doing wrong or how to get this to work?
>> 
>> Michael Knapp
>> 
>> 
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>> 



Re: In-memory cache in Drill

2017-05-11 Thread Padma Penumarthy
I am not sure about data. But, caching metadata, which is much smaller than 
actual data 
in memory will help with cutting down on planning time.
For ex. for Parquet, metadata cache files are read from disk every time a query 
is run. 
Also,  checking modification times (which in itself is an expensive file system 
operation)
for every query to make sure metadata cache is not stale is an overhead.

Thanks,
Padma


> On May 10, 2017, at 1:26 PM, Paul Rogers  wrote:
> 
> Hi Michael,
> 
> Caching can help — depending on what is cached. The Hive plugin caches schema 
> information to avoid hitting Hive for each query that needs the schema.
> 
> If your data is small, then you can cache. Maybe you have a file that maps 
> county codes to countries: you’d have 200+ entries that are used over and 
> over. But, if your data is small, and comes from a file, then it is likely 
> that your OS or file system already caches the data for you. It still needs 
> to be copied from a file into vectors, but that is a low cost for a small 
> table.
> 
> Caching data in heap memory is fairly safe (if the data is small): the memory 
> will be reclaimed eventually via the normal Java mechanisms. You’d have to 
> work out a cache invalidation strategy, and come up with a modified storage 
> plugin that is cache-aware — not a trivial task.
> 
> Caching data in direct memory, across queries, is a whole new area of 
> exploration. You would have to learn how Drill’s reference counting works — a 
> bit of a project that would have to have huge benefit to justify the costs.
> 
> If data is large, then it is doubtful that caching will help. If the caching 
> simply prevents rerunning, say, a JDBC query, then the suggested temp table 
> route would be fine. Or, outsource the work from Drill and have a periodic 
> batch job that does an ETL from the original system to a file in your file 
> system, then let the file system caching work for you.
> 
> Further, Drill tries to optimize this case: Drill will determine if it is 
> faster to read the file once, on the node with the data, and ship the data to 
> all the nodes that need it, or whether it is faster to do a remote read on 
> all nodes (which makes more sense for small files.) Your caching strategy 
> would have to be aware of what the planner is doing to avoid working at 
> cross-purposes.
> 
> What is the use case you are trying to optimize?
> 
> Thanks,
> 
> - Paul
> 
> 
>> On May 10, 2017, at 9:56 AM, Kunal Khatua  wrote:
>> 
>> Not really :)
>> 
>> 
>> You get into the problem of having to deal with cache management. Once you 
>> start using memory to serve a cache for holding a table in-memory, you are 
>> sacrificing the memory resource for doing the actual computation. Also, 
>> Drill actually tries to work with Direct Memory and not heap. To work around 
>> this, you would then have to introduce a swapping policy, so as to reclaim 
>> the memory.
>> 
>> 
>> If you were to use Heap for storing the table in memory, then Drill will 
>> need to copy the data into DirectMemory to do useful work. So now you have 
>> about 2x the memory being used for the data!
>> 
>> 
>> If you are using HDFS (or MapR-FS), these filesystems themselves implement a 
>> cache management, so we are already leveraging (to a limited extent) the 
>> benefits of an in-memory cache.
>> 
>> 
>> 
>> 
>> From: Michael Shtelma 
>> Sent: Wednesday, May 10, 2017 9:44:50 AM
>> To: user@drill.apache.org
>> Subject: Re: In-memory cache in Drill
>> 
>> yes, for sure this is also the viable approach... but it would be far
>> better to be able to have the data also in memory..
>> Does it make sense to have something like an in-memory storage plugin?
>> In this case it can be also used as a storage for the temporary
>> tables.
>> Sincerely,
>> Michael Shtelma
>> 
>> 
>> On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua  wrote:
>>> Drill does not cache data in memory because it introduces the risk of 
>>> dealing with stale data when working with data at a large scale.
>>> 
>>> 
>>> If you want to avoid hitting the actual storage repeatedly, one option is 
>>> to use the 'create temp table ' feature 
>>> (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This 
>>> allows you to land the data to a local (or distributed) F, and use that 
>>> data storage instead. These tables are alive only for the lifetime of the 
>>> session (connection your client/SQLLine) makes to the Drill cluster.
>>> 
>>> 
>>> There is a second benefit of using this approach. You can translate the 
>>> original data source into a format that is highly suitable to what you are 
>>> doing with the data. For e.g., you could pull in data from an RDBMS or a 
>>> JSON store and write the temp table in parquet for performing analytics on.
>>> 
>>> 
>>> ~ Kunal
>>> 
>>> 
>>> From: Michael Shtelma 

Re: Parquet, Arrow, and Drill Roadmap

2017-05-02 Thread Padma Penumarthy
One thing I want to add is use_new_reader uses reader from parquet-mr library, 
where as
default one is drill’s native reader which is supposed to be better, 
performance wise.
But, it does not support complex types and we automatically switch to use 
reader from parquet library
when we have to read complex types.

Thanks,
Padma


On May 2, 2017, at 11:09 AM, Jinfeng Ni 
> wrote:


- What the two readers are (is one a special drill thing, is the other  a
standard reader from the parquet project?)
- What is the eventual goal here... to be able to use and switch between
both? To provide the option? To have code parity with another project?

Both readers were for reading parquet data into Drill's value vector.
The default one (when store.parquet.use_new_reader is false) was
faster (based on measurements done by people worked on the two
readers), but it could not support complex type like map/array.  The
new reader would be used by Drill either if you change the option to
true, or when the parquet data you are querying contain complex type
(even with the default option being false). Therefore, both readers
might be used by Drill code.

There was a Parquet hackathon some time ago, which aimed to make
people in different projects using parquet work together to
standardize a vectorized reader. I did not keep track of that effort.
People with better knowledge of that may share their inputs.


- Do either of the readers work with Arrow?

For now, neither works with Arrow, since Drill has not integrated with
Arrow yet. See DRILL-4455 for the latest discussion
(https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
Drill's parquet reader will work with Arrow, once the integration is
done.



Re: [Drill 1.10.0] : Memory was leaked by query

2017-04-18 Thread Padma Penumarthy
Seems like you are running into  
DRILL-5435.
Try  turning off async parquet reader and see if that helps.
alter session set `store.parquet.reader.pagereader.async`=false;

Thanks,
Padma


On Apr 18, 2017, at 6:14 AM, Anup Tiwari 
> wrote:

Hi Team,

Please find following information :

*Cluster configuration :*
Number of Nodes : 5
Cores/Node : 8
RAM : 32

*Variable values :*
planner.width.max_per_node = 5
planner.width.max_per_query = 30
planner.memory.max_query_memory_per_node = 4294967296

I am getting following error on simple select statement which is coming 6
times out of 10 times, let me know if i am missing anything:

*Query :*
select udf_channel,uid from dfs.tmp.tt1 where (event = 'ajax' and ajaxurl
like '%/j_check%' and ajaxResponse like '%success%true%') limit 5;

*Error :*

ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR:
IllegalStateException: Memory was leaked by query. Memory leaked: (1048576)
Allocator(op:1:24:6:ParquetRowGroupScan)
100/1048576/27140096/100 (res/actual/peak/limit)


Fragment 1:24

[Error Id: a54cc1bf-794a-4143-bd82-0dd5fa3c8f52 on
prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
IllegalStateException: Memory was leaked by query. Memory leaked: (1048576)
Allocator(op:1:24:6:ParquetRowGroupScan)
100/1048576/27140096/100 (res/actual/peak/limit)


Fragment 1:24

[Error Id: a54cc1bf-794a-4143-bd82-0dd5fa3c8f52 on
prod-hadoop-101.bom-prod.aws.games24x7.com:31010]
   at
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
~[drill-common-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:293)
[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:262)
[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.10.0.jar:1.10.0]
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_72]
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_72]
   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
Caused by: java.lang.IllegalStateException: Memory was leaked by query.
Memory leaked: (1048576)
Allocator(op:1:24:6:ParquetRowGroupScan)
100/1048576/27140096/100 (res/actual/peak/limit)

   at
org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:502)
~[drill-memory-base-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.ops.OperatorContextImpl.close(OperatorContextImpl.java:149)
~[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.ops.FragmentContext.suppressingClose(FragmentContext.java:422)
~[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.ops.FragmentContext.close(FragmentContext.java:411)
~[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:318)
[drill-java-exec-1.10.0.jar:1.10.0]
   at
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:155)
[drill-java-exec-1.10.0.jar:1.10.0]
   ... 5 common frames omitted
2017-04-18 18:21:54,172 [BitServer-4] INFO
o.a.d.e.w.fragment.FragmentExecutor -
2709f415-c08a-13b9-9f05-fcf9008c484f:1:21: State change requested RUNNING
--> CANCELLATION_REQUESTED
2017-04-18 18:21:54,172 [BitServer-4] INFO
o.a.d.e.w.f.FragmentStatusReporter -
2709f415-c08a-13b9-9f05-fcf9008c484f:1:21: State to report:
CANCELLATION_REQUESTED
2017-04-18 18:21:54,173 [BitServer-4] WARN
o.a.d.e.w.b.ControlMessageHandler - Dropping request to cancel fragment.
2709f415-c08a-13b9-9f05-fcf9008c484f:1:24 does not exist.
2017-04-18 18:21:54,229 [2709f415-c08a-13b9-9f05-fcf9008c484f:frag:1:21]
INFO  o.a.d.e.w.fragment.FragmentExecutor -
2709f415-c08a-13b9-9f05-fcf9008c484f:1:21: State change requested
CANCELLATION_REQUESTED --> FAILED
2017-04-18 18:21:54,229 [2709f415-c08a-13b9-9f05-fcf9008c484f:frag:1:21]
INFO  o.a.d.e.w.fragment.FragmentExecutor -
2709f415-c08a-13b9-9f05-fcf9008c484f:1:21: State change requested FAILED
--> FAILED
2017-04-18 18:21:54,229 [2709f415-c08a-13b9-9f05-fcf9008c484f:frag:1:21]
INFO  o.a.d.e.w.fragment.FragmentExecutor -
2709f415-c08a-13b9-9f05-fcf9008c484f:1:21: State change requested FAILED
--> FINISHED
2017-04-18 18:21:54,230 [2709f415-c08a-13b9-9f05-fcf9008c484f:frag:1:21]
ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR:
IllegalStateException: Memory was leaked by query. Memory leaked: (1048576)
Allocator(op:1:21:6:ParquetRowGroupScan)

Re: NPE When Selecting from MapR-DB Table

2017-04-06 Thread Padma Penumarthy
Can you send the query profile ? How is your data distributed i.e. 
how big is the table and how many regions and avg row count per region ? 

This problem can happen when we don’t have a minor fragment (for scanning) 
scheduled on
a node which is hosting one or more hbase regions.  That can happen if we do 
not have
enough work to do (based on total rowCount and slice target) to schedule 
fragments on all nodes. 
One thing you can try is lower the slice target so we create fragments on all 
nodes. 
Depending upon your configuration, making it close to average rowCount per 
region might
be the ideal thing to do.   

Thanks,
Padma


> On Apr 6, 2017, at 12:03 PM, John Omernik  wrote:
> 
> By the way, are there any ways to manually prod the data to make it so the
> queries work again? It seems like an off by one type issue, can I add
> something to my data make it work?
> 
> On Thu, Apr 6, 2017 at 2:03 PM, John Omernik  wrote:
> 
>> Oh nice, 1.10 from MapR has the fix? Great... Looking forward to that!
>> Thanks!
>> 
>> On Thu, Apr 6, 2017 at 1:59 PM, Abhishek Girish 
>> wrote:
>> 
>>> I'm guessing the next release of Apache Drill could be a few months away.
>>> MapR Drill 1.10.0 release (which does contain the fix for DRILL-5395)
>>> should be out shortly, within the next week or so.
>>> 
>>> On Thu, Apr 6, 2017 at 11:50 AM, John Omernik  wrote:
>>> 
 Nope no other issues. I was waiting on 1.10 to be available from MapR,
>>> do
 we know the release date for 1.11?
 
 On Thu, Apr 6, 2017 at 1:31 PM, Abhishek Girish 
 wrote:
 
> Are there any other issues apart from the one being discussed? Builds
 from
> Apache should work with MapR-DB tables (when built with mapr profile).
 Let
> us know if you are having any trouble.
> 
> The fix for DRILL-5395 should be available this week, afaik. You could
 also
> build off Padma's branch if you need it urgently.
> 
> On Thu, Apr 6, 2017 at 11:25 AM, John Omernik 
>>> wrote:
> 
>> Is there any work around except wait? This is unfortunate...  I
>>> wonder
> if I
>> could beg a MapR build off the MapR team if I offered them
 beer/cookies.
> (
>> I have been unable to get MapR Tables to work with Builds off the
 Apache
>> main line)
>> 
>> 
>> 
>> 
>> 
>> On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish > wrote:
>> 
>>> Could be related to DRILL-5395
>>> . Once
>>> committed,
> the
>>> fix
>>> should be available in Apache master.
>>> 
>>> On Thu, Apr 6, 2017 at 10:56 AM, John Omernik 
> wrote:
>>> 
 Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a
 large
>>> load
 of data into a mapr table. I was able to confirm that the table
> returns
 data from the c api for hbase, so no issue there, however, when
>>> I
>> select
 from the table in Drill, either from the table directly, or
>>> from a
>> view I
 created, then I get the NPE as listed below. Any advice on how
>>> to
 troubleshoot further would be appreciated!
 
 
 0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from
> maprpcaps
 limit 1;
 
 Error: SYSTEM ERROR: NullPointerException
 
 
 
 [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
 zeta4.brewingintel.com:20005]
 
 
  (org.apache.drill.exec.work.foreman.ForemanException)
>>> Unexpected
 exception during fragment initialization: null
 
org.apache.drill.exec.work.foreman.Foreman.run():281
 
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
 
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
 
java.lang.Thread.run():745
 
  Caused By (java.lang.NullPointerException) null
 
 
 org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.
>>> applyAssignments():205
 
 
 org.apache.drill.exec.planner.fragment.Wrapper$
 AssignEndpointsToScanAndStore.visitGroupScan():116
 
 
 org.apache.drill.exec.planner.fragment.Wrapper$
 AssignEndpointsToScanAndStore.visitGroupScan():103
 
org.apache.drill.exec.physical.base.
> AbstractGroupScan.accept():63
 
 
 org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
 visitChildren():138
 
 
 org.apache.drill.exec.planner.fragment.Wrapper$
 AssignEndpointsToScanAndStore.visitOp():134
 
 
 org.apache.drill.exec.planner.fragment.Wrapper$
 

Re: Wrong property value

2017-03-27 Thread Padma Penumarthy
Yes, you are right. We need to update the documentation with 
the correct option name.  Thanks for bringing it up.

Thanks,
Padma


> On Mar 27, 2017, at 1:50 AM, Muhammad Gelbana  wrote:
> 
> According to this page
> , Drill can
> implicitly interprets the INT96 timestamp data type in Parquet files after
> setting the *store.parquet.int96_as_timestamp* option to *true*.
> 
> I believe the option name should be
> *store.parquet.reader.int96_as_timestamp*
> 
> Or did I miss something ?
> 
> *-*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana



Re: Explain Plan for Parquet data is taking a lot of timre

2017-03-05 Thread Padma Penumarthy
t; 
> The query response time is found as below:
> 2 node cluster - 13min
> 5 node cluster - 19min
> 
> I was expecting 5 node cluster to be faster, but the results say otherwise.
> In the query profile, as expected, 5 node cluster has more minor fragments, 
> but still the scan time is higher. Attached the json profile for both.
> Is this in anyway related to the max batches/max records for row group scan?
> 
> Any suggestions on how we can get better response time in the 5 node cluster 
> is appreciated.
> 
> Regards
> Jeena
> 
> -Original Message-
> From: Jeena Vinod
> Sent: Sunday, February 26, 2017 2:22 AM
> To: user@drill.apache.org
> Subject: RE: Explain Plan for Parquet data is taking a lot of timre
> 
> Please find attached the full JSON profile.
> 
> Regards
> Jeena
> 
> -Original Message-
> From: Padma Penumarthy [mailto:ppenumar...@mapr.com]
> Sent: Saturday, February 25, 2017 3:31 AM
> To: user@drill.apache.org
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
> 
> Yes, please do send the JSON profile.
> 
> Thanks,
> Padma
> 
>> On Feb 24, 2017, at 1:56 PM, Jeena Vinod <jeena.vi...@oracle.com> wrote:
>> 
>> Thanks for the suggestions.
>> 
>> I did run REFRESH TABLE METADATA command on this path before firing select 
>> query.
>> 
>> In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 
>> 2 node 16GB cluster and here select * with limit 100 is taking less time 
>> than 1.8, though the number of rows in ParquetGroupScan remains unchanged. 
>> Select query is taking around 8 minutes and explain plan took around 7 
>> minutes. Also in the Web console profile, the query stays in the STARTING 
>> status for almost 7 minutes.
>> 
>> Query Plan for 1.9:
>> 00-00Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative 
>> cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
>> 1721
>> 00-01  Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, 
>> cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 
>> memory}, id = 1720
>> 00-02SelectionVectorRemover : rowType = (DrillRecordRow[*]): 
>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 
>> network, 0.0 memory}, id = 1719
>> 00-03  Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount 
>> = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 
>> 0.0 memory}, id = 1718
>> 00-04Scan(groupscan=[ParquetGroupScan 
>> [entries=[ReadEntryWithPath 
>> [path=/testdata/part-r-0-097f7399-7bfb-4e93-b883-3348655fc658.parquet]], 
>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
>> cacheFileRoot=/testdata, columns=[`*`]]]) : rowType = (DrillRecordRow[*]): 
>> rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 
>> 0.0 network, 0.0 memory}, id = 1717
>> 
>> And from the query profile, it looks like the most time is spent in 
>> PARQUET_ROW_GROUP_SCAN. I can attach the full JSON profile if it helps.
>> 
>> Can there be further improvement in performance with 1.9?
>> 
>> Regards
>> Jeena
>> 
>> 
>> -Original Message-
>> From: Padma Penumarthy [mailto:ppenumar...@mapr.com]
>> Sent: Friday, February 24, 2017 11:22 PM
>> To: user@drill.apache.org
>> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>> 
>> Yes, limit is pushed down to parquet reader in 1.9. But, that will not help 
>> with planning time.
>> It is definitely worth trying with 1.9 though.
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <aengelbre...@mapr.com> 
>>> wrote:
>>> 
>>> Looks like the metadata cache is being used  "usedMetadataFile=true, ". But 
>>> to be sure did you perform a REFRESH TABLE METADATA  on the 
>>> parquet data?
>>> 
>>> 
>>> However it looks like it is reading a full batch " rowcount = 32600.0, 
>>> cumulative cost = {32600.0 rows, 32600.0"
>>> 
>>> 
>>> Didn't the limit operator get pushed down to the parquet reader in 1.9?
>>> 
>>> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets 
>>> reduced to 100.
>>> 
>>> 
>>> Can you look in the query profile where time is spend, also how long it 
>>> takes before the query starts to run in the WebUI profile.
>>> 
>>> 
>>&

Re: Explain Plan for Parquet data is taking a lot of timre

2017-02-24 Thread Padma Penumarthy
Yes, limit is pushed down to parquet reader in 1.9. But, that will not help 
with planning time. 
It is definitely worth trying with 1.9 though.

Thanks,
Padma


> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht  
> wrote:
> 
> Looks like the metadata cache is being used  "usedMetadataFile=true, ". But 
> to be sure did you perform a REFRESH TABLE METADATA  on the 
> parquet data?
> 
> 
> However it looks like it is reading a full batch " rowcount = 32600.0, 
> cumulative cost = {32600.0 rows, 32600.0"
> 
> 
> Didn't the limit operator get pushed down to the parquet reader in 1.9?
> 
> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets 
> reduced to 100.
> 
> 
> Can you look in the query profile where time is spend, also how long it takes 
> before the query starts to run in the WebUI profile.
> 
> 
> Best Regards
> 
> 
> Andries Engelbrecht
> 
> 
> Senior Solutions Architect
> 
> MapR Alliances and Channels Engineering
> 
> 
> aengelbre...@mapr.com
> 
> 
> [1483990071965_mapr-logo-signature.png]
> 
> 
> From: Jinfeng Ni 
> Sent: Thursday, February 23, 2017 4:53:34 PM
> To: user
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
> 
> The reason the plan shows only one single parquet file is because
> "LIMIT 100" is applied and filter out the rest of them.
> 
> Agreed that parquet metadata caching might help reduce planning time,
> when there are large number of parquet files.
> 
> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
>  wrote:
>> You said there are 2144 parquet files but the plan suggests that you only
>> have a single parquet file. In any case its a long time to plan the query.
>> Did you try the metadata caching feature [1]?
>> 
>> Also how many rowgroups and columns are present in the parquet file?
>> 
>> [1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>> 
>> - Rahul
>> 
>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod  wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Drill is taking 23 minutes for a simple select * query with limit 100 on
>>> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
>>> that long(~23 minutes).
>>> 
>>> Query: select * from .root.`testdata` limit 100;
>>> 
>>> Query  Plan:
>>> 
>>> 00-00Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
>>> memory}, id = 1429
>>> 
>>> 00-01  Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
>>> 0.0 memory}, id = 1428
>>> 
>>> 00-02SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>>> network, 0.0 memory}, id = 1427
>>> 
>>> 00-03  Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
>>> network, 0.0 memory}, id = 1426
>>> 
>>> 00-04Scan(groupscan=[ParquetGroupScan
>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-0-
>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
>>> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
>>> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
>>> memory}, id = 1425
>>> 
>>> 
>>> 
>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
>>> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
>>> in this location it is taking only 38 seconds .
>>> 
>>> Also testdata contains around 2144 .parquet files each around 500KB.
>>> 
>>> 
>>> 
>>> Is there any additional configuration required for parquet?
>>> 
>>> Kindly suggest how to improve the response time here.
>>> 
>>> 
>>> 
>>> Regards
>>> Jeena
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 



Re: Drill query planning taking a LONG time

2017-02-17 Thread Padma Penumarthy
We have a JIRA for this issue that hopefully will be fixed in the next release.
https://issues.apache.org/jira/browse/DRILL-5089

Thanks,
Padma


On Feb 17, 2017, at 1:50 PM, David Kincaid 
> wrote:

My apologies for not following up sooner. Earlier this week our DevOps
engineer was looking into this problem as well and discovered the root
cause of our issue. We developed a custom storage provider that utilizes S3
as the pstore. We thought this was just storing configuration information
(esp. storage plugin config), but we discovered that it was spending a lot
of time reading files in a /temp/drill subdirectory of our S3 bucket. We
removed the custom plugin and things are running much better now.

I have one of our engineers working on this now to see where we went wrong.
My question for the list now is if you know what exactly it is doing. We
really want to be able to store the storage plugin config on S3 so that it
is persisted between restarts of the EMR cluster that we are running Drill
on. If you have any suggestions or advice it would be much appreciate.

I really appreciate all the time and patience you all showed helping us
troubleshoot this issue. I'm glad in the end that it really was something
on our end and not something more mysterious happening in Drill itself.

Thanks,

Dave

On Wed, Feb 15, 2017 at 12:37 PM, Jinfeng Ni 
> wrote:

Can you help try one more thing if you can?

Run jstack on the foreman Drillbit process while the query is doing
the query planning. Capture the jstack every one second or couple of
seconds consecutively for some time, by appending the jstack output
into one log file. Take a look at the stack trace for the forman
thread in the form of "275b623b-bb15-8bd8-fd29-f9a571a7534e:foreman"
(The first part is the query ID).  If foreman thread is stuck in one
method call, it  may show up in the log repeatedly.  In this way we
may have a better idea what's the cause of the problem.

Based on the tests you tried, the combination of the query / parquet
files probably hit a bug in the code that we are not aware of
currently. Without the parquet files to re-reproduce, it's hard to
debug the issue and find a possible fix.



On Wed, Feb 15, 2017 at 8:35 AM, David Kincaid 
>
wrote:
I ran that EXPLAIN that you suggested against the entire 100 file table
and
it takes about 3 seconds. I will try to get a defect written up in the
next
few days.

- Dave

On Tue, Feb 14, 2017 at 9:06 PM, Jinfeng Ni 
> wrote:

>From the two tests you did, I'm inclined to think there might be some
special things in your parquet files. How do you generate these
parquet files? Do they contain normal data type (int/float/varchar),
or complex type (array/map)?

In our environment, we also have hundreds of parquet files, each with
size ~ hundreds of MBs.  A typical query (several tables joined) would
takes a couple of seconds in planning.

One more test if you can help run.

EXPLAIN PLAN FOR
SELECT someCol1, someCol2
FROM dfs.`parquet/transaction/OneSingleFile.parquet`;

The above query is simple enough that planner should not spend long
time in enumerating different choices. If it still takes long time for
query planning,  the more likely cause might be in parquet files you
used.



On Tue, Feb 14, 2017 at 1:06 PM, David Kincaid 
>
wrote:
I will write up a defect. The first test you suggested below - running
the
query on just one of our Parquet files produces the same result (10-12
minute planning time). However, the second test - using
cp.`tpch/nation.parquet` - results in a planning time of only about a
minute. So, I'm not sure how to interpret that. What does that mean to
you
all?

- Dave

On Tue, Feb 14, 2017 at 12:37 PM, Jinfeng Ni 
> wrote:

Normally, the slow query planning could be caused by :

1. Some planner rule hit a bug when processing certain operators in
the query, for instance join operator, distinct aggregate.  The query
I tried on a small file seems to rule out this possibility.
2. The parquet metadata access time. According to the long, this does
not seem to be the issue.
3. Something we are not aware of.

To help get some clue, can you help do the following:
1. run the query over one single parquet files, in stead of 100
parquet files? You can change using
dfs.`parquet/transaction/OneSingleFile.parquet`. I'm wondering if
the
planning time is proportional to # of parquet files.

2. What if you try your query by replacing
dfs.`parquet/transaction/OneSingleFile.parquet` with
cp.`tpch/nation.parquet` which is a small tpch parquet file (you need
re-enable the storage plugin 'cp')? Run EXPLAIN should be fine. This
will tell us if the problem is caused by the parquet source, or the
query itself.

Yes, please create a 

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

2016-12-01 Thread Padma Penumarthy
Yes, for every query, we build schema tree by trying to initialize
all storage plugins and workspaces in them, regardless of schema configuration 
and/or applicability to data being queried. Go ahead and file a JIRA.
We are looking into fixing this.

Thanks,
Padma


> On Dec 1, 2016, at 8:48 AM, Abhishek Girish  wrote:
> 
> AFAIK, should apply to all queries, irrespective of the source of the data
> or the plugins involved within the query. So when this issue occurs, I
> would expect any query to take long to execute.
> 
> On Thu, Dec 1, 2016 at 5:47 AM John Omernik  wrote:
> 
>> @Abhishek,
>> 
>> Do you think the issue is related to any storage plugin that is enabled and
>> not available as it applies to all queries?  I guess if it's an issue where
>> all queries are slow because the foreman is waiting to initialize ALL
>> storage plugins, regardless of their applicability to the queried data,
>> then that is a more general issue (that should still be resolved, does the
>> foreman need to initialize all plugins before querying specific data?)
>> However, I am still concerned that the query on the CTAS parquet data is
>> specifically slower because of it's source.  @Rahul could you test a
>> different Parquet table, NOT loaded from the SQL server to see if the
>> enabling or disabling the JDBC storage plugin (with the server unavailable)
>> has any impact?  Basically, I want to ensure that data that is created as a
>> Parquet table via CTAS is 100% free of any links to the source data. This
>> is EXTREMELY important.
>> 
>> John
>> 
>> 
>> 
>> On Thu, Dec 1, 2016 at 12:46 AM, Abhishek Girish <
>> abhishek.gir...@gmail.com>
>> wrote:
>> 
>>> Thanks for the update, Rahul!
>>> 
>>> On Wed, Nov 30, 2016 at 9:45 PM Rahul Raj <
>> rahul@option3consulting.com
 
>>> wrote:
>>> 
 Abhishek,
 
 Your observation is correct, we just verified that:
 
   1. The queries run as expected(faster) with Jdbc plugin disabled.
   2. Queries run as expected when the plugin's datasource is running.
   3. With the datasource down, queries run very slow waiting for the
   connection to fail
 
 Rahul
 
 On Thu, Dec 1, 2016 at 10:07 AM, Abhishek Girish <
 abhishek.gir...@gmail.com>
 wrote:
 
> @John,
> 
> I agree that this should work. While I am not certain, I don't think
>>> the
> issue is specific to a particular plugin, but the way in a query's
> lifecycle, the foreman attempts to initialize every enabled storage
 plugin
> before proceeding to execute the query. So when a particular plugin
>>> isn't
> configured correctly or the underlying datasource is not up, this
>> could
> drastically slow down the query execution time.
> 
> I'll look up to see if we have a JIRA for this already - if not will
>>> file
> one.
> 
> On Wed, Nov 30, 2016 at 8:12 AM, John Omernik 
>>> wrote:
> 
>> So just my opinion in reading this thread.  (sorry for swooping in
>> an
>> opining)
>> 
>> If a CTAS is done from any data source into Parquet files there
> should
>> be NO dependency on the original data source to query the resultant
> Parquet
>> files.   As a Drill user, as a Drill admin, this breaks the concept
>>> of
>> least surprise.  If I take data from one source, and create Parquet
 files
>> in a distributed file system, it should just work.  If there are
 "issues"
>> with JDBC plugins or the HBase/Hive plugins in a similar manner,
>>> these
>> needs to be hunted down by a large group of villages with
>> pitchforks
 and
>> torches.  I just can't see how this could be acceptable at any
>> level.
 The
>> whole idea of Parquet files is they are self describing, schema
 included
>> files thus a read of a directory of Parquet files should have
>> NO
>> dependancies on anything but the parquet files... even the Parquet
>> "additions" (such as the METADATA Cache) should be a fail open
>>> thing...
> if
>> it exists great, use it, speed things up, but if it doesn't read
>> the
>> parquet files as normal (Which I believe is how it operates)
>> 
>> John
>> 
>> On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish <
>> abhishek.gir...@gmail.com
>>> wrote:
>> 
>>> Can you attempt to disable to jdbc plugin (configured with
>>> SQLServer)
> and
>>> try the query (on parquet) when SQL Server is offline?
>>> 
>>> I've seen a similar issue previously when the HBase / Hive plugin
>>> was
>>> enabled but either the plugin configuration was wrong or the
 underlying
>>> data source was down.
>>> 
>>> On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj
> >> com>
>>> wrote:
>>> 
 I have created a parquet file using CTAS from a MS SQL Server.
>>> The
>> query

Re: Apache Drill Error

2016-09-21 Thread Padma Penumarthy
Sorry, I meant Try (not Trying)

> On Sep 21, 2016, at 10:08 AM, Padma Penumarthy <ppenumar...@maprtech.com> 
> wrote:
> 
> Trying commenting out the following line in 
> distribution/target/apache-drill-1.9.0-SNAPSHOT/apache-drill-1.9.0-SNAPSHOT/bin/drill-config.sh
> 
> export JAVA_HOME
> 
> Thanks,
> Padma
> 
> 
>> On Sep 20, 2016, at 7:17 PM, Rajasimman Selvaraj <rajasimm...@gmail.com 
>> <mailto:rajasimm...@gmail.com>> wrote:
>> 
>> Hi ,
>>   I am getting the following error while starting the drill in Mac OS for 
>> the first time.
>> Unable to locate an executable at 
>> "/System/Library/Frameworks/JavaVM.framework/Versions/A/bin/java" (-1)
>> 
>> I have the java JDK installed in my machine as listed below.
>> RajaSimmans-MBP:apache-drill-1.8.0 Hadoop$ java -version
>> java version "1.8.0_101"
>> Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
>> RajaSimmans-MBP:apache-drill-1.8.0 Hadoop$ 
>> 
>> Please help me to get the drill start and running in my Mac OS.
>> 
>> Regards,
>> S.RajaSimman.
> 



Re: Apache Drill Error

2016-09-21 Thread Padma Penumarthy
Trying commenting out the following line in 
distribution/target/apache-drill-1.9.0-SNAPSHOT/apache-drill-1.9.0-SNAPSHOT/bin/drill-config.sh

export JAVA_HOME

Thanks,
Padma


> On Sep 20, 2016, at 7:17 PM, Rajasimman Selvaraj  
> wrote:
> 
> Hi ,
>   I am getting the following error while starting the drill in Mac OS for the 
> first time.
> Unable to locate an executable at 
> "/System/Library/Frameworks/JavaVM.framework/Versions/A/bin/java" (-1)
> 
> I have the java JDK installed in my machine as listed below.
> RajaSimmans-MBP:apache-drill-1.8.0 Hadoop$ java -version
> java version "1.8.0_101"
> Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
> RajaSimmans-MBP:apache-drill-1.8.0 Hadoop$ 
> 
> Please help me to get the drill start and running in my Mac OS.
> 
> Regards,
> S.RajaSimman.