Re: Help with Apache Drill - S3 compatible storage connectivity

2018-06-26 Thread Parth Chandra
Drill uses HDFS to access S3, so if you have configured the EMC system to
be usable by hadoop, it will be usable by Drill.
Here's the documentation for an S3 compatible EMC system (
https://www.emc.com/collateral/TechnicalDocument/docu86295.pdf); chapters
6-11 are relevant.  I'm not sure if this is the same system you have, but
your system should have similar documentation.
You will probably have to use a different protocol identifier in the URL to
access the storage system.




On Tue, Jun 26, 2018 at 11:30 AM, dummy id  wrote:

> Can i get an update on this please?
>
> On Fri, Jun 15, 2018 at 11:36 AM, dummy id  wrote:
>
> > *Team, *
> >
> >
> >
> > *I am not sure who can help me out with this, so just adding both of the
> > help community..I have followed your  documentation on setting up drill
> and
> > i am able to query the files from local , meaning class path and dfs, but
> > not able to do it with s3. I am not using Amazon S3 instead i am using S3
> > compatible storage from DELL EMC. I would need your help in setting up
> the
> > storage plugin file so that it uses path style addressing instead of
> normal
> > URL method. I would request you to kindly share an example core-site.xml
> > file as well as an example storage plugin file where it uses path style
> > addressing method to connect to s3.*
> >
> >
> >
> > *I have tried using fs**.**s3a**.**path**.**style**.**access with
> > value true in both core-site and storage plugin files, but still the path
> > style addressing is not being read by drill and it again tries to connect
> > using URL method as it uses in connecting with Amazon S3. Kindly help..*
> >
> >
> >
> > *Just an FYI, i have followed steps from "Drill in 10 minutes"
> > documentation to install and connect with my S3 compatible storage.*
> >
> >
> >
> > *Awaiting for your reply..*
> >
> >
> >
>


Re: Hangout tomorrow (get your tickets now)

2018-06-12 Thread Parth Chandra
Short summary of the call -

Attendees:
  Abhishek, Aman, Arina, Oleksander, Gautam, Samiksha, Timothy, Vitalii, Parth,
DC, Sorabh, Jyothsna, Chun, Robert, Hau, Boaz, Pritesh,  Kunal

Arina - case insensitive storage plugin and workspace names
  General agreement on case insenstive names. Let us implement and get
feedback from users. Backward compatibility to be considered. Arina pointed
out that names with only difference in case is broken already so needs to
be fixed. Discussion is ongoing on mailing list.

Samiksha
  Problem with Hive configuration. Show tables shows no tables. Arina
suggested that it could be because of running in embedded mode. Pritesh
offerred to connect Samiksha with someone who can help.

Timothy
  Presentation on the resource management proposal. Tim had already
presented this offline to some members of the dev team. Overview of current
system and the tuning parameters.
  Aman asked about exceeding of memory by buffered operators. Tim clarified
that SetLenient allows operators to exceed memory limits.
  New proposal was discussed. Questions around planning and queueing of
queries. Aman suggested a two level assignment: an initial assignment based
on user and a second level based on resource availability constraints.
  Timothy to continue with a second presentation on hangout to allow more
feedback and discussion.

  Presentation can be found here:
https://docs.google.com/presentation/d/1QuB0hHyX4yupcmtbfYl7WWOQVdWdnst-1QP1oIrkZv0/edit?usp=sharing


On Tue, Jun 12, 2018 at 5:02 AM, Arina Yelchiyeva <
arina.yelchiy...@gmail.com> wrote:

> I'd like to discuss case insensitive storage plugin and workspaces names
> (sent the email to the dev & user mailing lists with the details).
>
> Kind regards,
> Arina
>
> On Tue, Jun 12, 2018 at 4:07 AM Timothy Farkas  wrote:
>
> > Hi,
> >
> > I'd like to give the presentation for the resource management proposal.
> >
> > Thanks,
> > Tim
> > 
> > From: Parth Chandra 
> > Sent: Monday, June 11, 2018 5:02:51 PM
> > To: dev; user@drill.apache.org
> > Subject: Hangout tomorrow (get your tickets now)
> >
> > We'll have the Drill hangout tomorrow Jun12th, 2018 at 10:00 PDT.
> >
> > If you have any topics to discuss, send a reply to this post or just join
> > the hangout.
> >
> > ( Drill hangout link
> > <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.
> google.com_hangouts_-5F_event_ci4rdiju8bv04a64efj5fedd0lc=DwIBaQ=
> cskdkSMqhcnjZxdQVpwTXg=4eQVr8zB8ZBff-yxTimdOQ=UPPuicyupblaeOPjMSwKK1-
> dGXsAIiyxcOxyaMSFWRE=IO6y2b5J3fDZTNNFGRaqjqyYYVpyHraUY9UflGnCfu0=>
> > )
> >
> > Thanks
> >
> > Parth
> >
>


Hangout tomorrow (get your tickets now)

2018-06-11 Thread Parth Chandra
We'll have the Drill hangout tomorrow Jun12th, 2018 at 10:00 PDT.

If you have any topics to discuss, send a reply to this post or just join
the hangout.

( Drill hangout link
 )

Thanks

Parth


Re: Failed to fetch parquet metadata after 15000ms

2018-05-10 Thread Parth Chandra
That might be it. How big is the schema of your data? Do you have lots of
fields? If parquet-tools cannot read the metadata, there is little chance
anybody else will be able to do so either.


On Thu, May 10, 2018 at 9:57 AM, Carlos Derich <carlosder...@gmail.com>
wrote:

> Hey Parth, thanks for the response !
>
> I tried fetching the metadata using parquet-tools Hadoop mode instead, and
> I get OOM errors: Heap and GC limit exceeded.
>
> It seems that my problem is actually resource related, still a bit weird
> how parquet metadata read is so hungry ?
>
> It seems that even after a restart (clean state/no queries running) only
> ~4GB mem is free from a 16GB machine.
>
> I am going to run the tests on a bigger machine, and will tweak the JVM
> options and will let you know.
>
> Regards.
> Carlos.
>
> On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <par...@apache.org> wrote:
>
> > The most common reason I know of for this error is if you do not have
> > enough CPU. Both Drill and the distributed file system will be using cpu
> > and sometimes the file system, especially if it is distributed, will take
> > too long. With your configuration and data set size, reading the file
> > metadata should take no time at all (I'll assume the metadata in the
> files
> > is reasonable and not many MB itself).  Is your system by any chance
> > overloaded?
> >
> > Also, call me paranoid, but seeing /tmp in the path makes me suspicious.
> > Can we assume the files are written completely when the metadata read is
> > occurring. They probably are, since you can query the files individually,
> > but I'm just checking to make sure.
> >
> > Finally, there is a similar JIRA
> > https://issues.apache.org/jira/browse/DRILL-5908, that looks related.
> >
> >
> >
> >
> > On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <carlosder...@gmail.com>
> > wrote:
> >
> > > Hello guys,
> > >
> > > Asking this question here because I think i've hit a wall with this
> > > problem, I am consistently getting the same error, when running a query
> > on
> > > a directory-based parquet file.
> > >
> > > The directory contains six 158MB parquet files.
> > >
> > > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> > > metadata' are not complete. Total runnable size 6, parallelism 6.
> > >
> > >
> > > Both queries fail:
> > >
> > > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-
> 9793-1c333d87fbbb/`*
> > >
> > > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`
> > > limit 1*
> > >
> > > BUT If I try running any other query in any of the 6 parquet files
> inside
> > > the directory it works fine:
> > > eg:
> > > *select * from
> > > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_
> > docker_node0001-
> > > 140526122190592.parquet`*
> > >
> > > Running *`refresh table metadata`* gives me the exact same error.
> > >
> > > Also tried to set *planner.hashjoin* to false.
> > >
> > > Checking the drill source it seems that the wait metadata timeout is
> not
> > > configurable.
> > >
> > > Have any of you faced a similar situation ?
> > >
> > > Running this locally on my 16GB RAM machine, hdfs in a single node.
> > >
> > > I also found an open ticket with the same error message:
> > > https://issues.apache.org/jira/browse/DRILL-5903
> > >
> > > Thank you in advance.
> > >
> >
>


Re: no current connection error when accessing drill in sql line in distributed mode

2018-05-10 Thread Parth Chandra
Seems like the command line arguments are not getting passed in correctly.
Can you try putting the arguments to -u in quotes?
sqlline -u "jdbc:drill:zk=xx.xx.xx.x:5181,xx.xx.xx.x:5181,xx.xx.xx.x:5181"
Also make sure you're not picking up a different script called sqlline.

On Wed, May 9, 2018 at 8:52 PM, Divya Gehlot 
wrote:

> Hi,
> I am trying to access Drill through Sqlline ,I am getting below error :
>
> [mapr@usazprdmapr-dn1 bin]$ sqlline –u
> > jdbc:drill:zk=xx.xx.xx.x:5181,xx.xx.xx.x:5181,xx.xx.xx.x:5181
> > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M;
> > support was removed in 8.0
> > –u (No such file or directory)
> > jdbc:drill:zk=xx.xx.xx.x:5181,xx.xx.xx.x:5181,xx.xx.xx.x:5181 (No such
> > file or directory)
> > apache drill 1.10.0
> > "a drill is a terrible thing to waste"
> > sqlline> show datatables;
> > No current connection
> > sqlline> select * from sys.version ;
> > No current connection
>
>
> Appreciate the help !
> Thanks,
> Divya
>


Re: Failed to fetch parquet metadata after 15000ms

2018-05-09 Thread Parth Chandra
The most common reason I know of for this error is if you do not have
enough CPU. Both Drill and the distributed file system will be using cpu
and sometimes the file system, especially if it is distributed, will take
too long. With your configuration and data set size, reading the file
metadata should take no time at all (I'll assume the metadata in the files
is reasonable and not many MB itself).  Is your system by any chance
overloaded?

Also, call me paranoid, but seeing /tmp in the path makes me suspicious.
Can we assume the files are written completely when the metadata read is
occurring. They probably are, since you can query the files individually,
but I'm just checking to make sure.

Finally, there is a similar JIRA
https://issues.apache.org/jira/browse/DRILL-5908, that looks related.




On Wed, May 9, 2018 at 4:15 PM, Carlos Derich 
wrote:

> Hello guys,
>
> Asking this question here because I think i've hit a wall with this
> problem, I am consistently getting the same error, when running a query on
> a directory-based parquet file.
>
> The directory contains six 158MB parquet files.
>
> RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> metadata' are not complete. Total runnable size 6, parallelism 6.
>
>
> Both queries fail:
>
> *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`*
>
> *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`
> limit 1*
>
> BUT If I try running any other query in any of the 6 parquet files inside
> the directory it works fine:
> eg:
> *select * from
> dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_docker_node0001-
> 140526122190592.parquet`*
>
> Running *`refresh table metadata`* gives me the exact same error.
>
> Also tried to set *planner.hashjoin* to false.
>
> Checking the drill source it seems that the wait metadata timeout is not
> configurable.
>
> Have any of you faced a similar situation ?
>
> Running this locally on my 16GB RAM machine, hdfs in a single node.
>
> I also found an open ticket with the same error message:
> https://issues.apache.org/jira/browse/DRILL-5903
>
> Thank you in advance.
>


Re: Handhsake Error

2018-05-09 Thread Parth Chandra
Try to set the logging level of the drillbit to trace. In your logback.xml
-






There should be a message starting with "Handling handshake ... "





On Wed, May 9, 2018 at 11:12 AM, Peter Edike <
peter.ed...@interswitchgroup.com> wrote:

> I have increased the timeout and I am quite sure there are no
> authentication modules enabled. Is there anyway I can check if the request
> is getting to the server in the first place
>
> On May 9, 2018 7:09 PM, Parth Chandra <par...@apache.org> wrote:
> If you haven't tried it already, try increasing the handshake timeout. Do
> you have any security/authentication settings turned on? One possibility is
> that an authentication module is being accessed by the server during the
> handshake and the server is taking too long to reply to the handshake
> causing the timeout.
>
>
>
>
>
> On Wed, May 9, 2018 at 3:05 AM, Peter Edike <
> peter.ed...@interswitchgroup.com> wrote:
>
> >
> > Hello everyone
> >
> > I am trying to setup a datasource to connect to a drillbit running on a
> > host
> >  I am using the Direct To DrillBit Option And have specified the
> > ip-address of the server on which the drill bit service is running as
> well
> > as the appropriate ports
> > I can telnet from my windows pc into this port via the telnet command but
> > all attempts to initiate a connection from DSN Setup Dialog Box Fails
> with
> > Handshake Error.
> >
> > FAILED!
> >
> > [MapR][Drill] (1010) Error occurred while trying to connect:
> [MapR][Drill]
> > (40)  Handshake timed out (HandshakeTimeout=5) while trying to connect to
> > local=172.x.x.x:31010. Check whether Drillbit is running in a healthy
> state
> > or increase the timeout.
> >
> > Warm Regards
> > Peter E
> > 
> >
> > This message has been marked as CONFIDENTIAL on Wednesday, May 9, 2018 @
> > 11:05:23 AM
> >
> >
>


Re: Handhsake Error

2018-05-09 Thread Parth Chandra
If you haven't tried it already, try increasing the handshake timeout. Do
you have any security/authentication settings turned on? One possibility is
that an authentication module is being accessed by the server during the
handshake and the server is taking too long to reply to the handshake
causing the timeout.





On Wed, May 9, 2018 at 3:05 AM, Peter Edike <
peter.ed...@interswitchgroup.com> wrote:

>
> Hello everyone
>
> I am trying to setup a datasource to connect to a drillbit running on a
> host
>  I am using the Direct To DrillBit Option And have specified the
> ip-address of the server on which the drill bit service is running as well
> as the appropriate ports
> I can telnet from my windows pc into this port via the telnet command but
> all attempts to initiate a connection from DSN Setup Dialog Box Fails with
> Handshake Error.
>
> FAILED!
>
> [MapR][Drill] (1010) Error occurred while trying to connect: [MapR][Drill]
> (40)  Handshake timed out (HandshakeTimeout=5) while trying to connect to
> local=172.x.x.x:31010. Check whether Drillbit is running in a healthy state
> or increase the timeout.
>
> Warm Regards
> Peter E
> 
>
> This message has been marked as CONFIDENTIAL on Wednesday, May 9, 2018 @
> 11:05:23 AM
>
>


Re: Not Able to Query Part files Using Drill

2018-05-07 Thread Parth Chandra
What part files are these? Can you share the workspace settings? Also what
is the detailed error message you're getting.

On Mon, May 7, 2018 at 3:37 AM, Surneni Tilak 
wrote:

> Hi Team,
>
> I am trying to run drill query on part files present in the local file
> system. But Drill is throwing Table not found error , I think it is not
> able to identify the file format and giving that error. I have tried to
> include  default input format option mentioned in the Drill website in my
> workspace even then same error is repeating.  Please help me on this.
>
> Best regards,
> _
> Tilak
>
>
>


Re: Apache Drill + Google Storage via GCS Connector and Dataproc

2018-05-07 Thread Parth Chandra
Drill uses HDFS to connect to cloud storage, so if you can connect to the
storage via hdfs, you should be able to connect using Drill.
When you configure the FileSystem storage plugin, you might need to specify
the url as "gs://" instead of "hdfs://" [1]
If you have already done so, it is possible that the cloud storage
connector us not in the Drill class path.

HTH.

[1] https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage



On Mon, May 7, 2018 at 8:51 AM, Joe Auty  wrote:

> Hello,
>
> We are trying to get this combo working via the instructions found here:
>
> https://stackoverflow.com/questions/32883965/apache-drill-
> using-google-cloud-storage
>
> We are attempting this with a manually compiled version of 1.14, and can
> query GS via gsutil/hadoop commands which suggests the GCS connector is
> installed correctly and working as expected.
>
>
> I don't know if there will be any takers here to help us get into the
> weeds with this particular problem (we are seeing
> "org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR: Failed to create DrillFileSystem for proxy user: No FileSystem for
> scheme: gs SQL Query null"), and of course I'm happy to provide more
> details, but if not it would also be helpful to address some more high
> level questions about this:
>
>
> 1) is there a way to get this to work currently?
> 2) if so, is there a reason why this isn't documented (e.g. is it not
> properly tested/recommended?)
> 3) will GS be an officially supported plugin at some point in the future?
>
>
> Thanks in advance!
>


Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2018-03-19 Thread Parth Chandra
Hi Anup,

 I don't have full context for the proposed hack, and it might have worked,
but looks like Vlad has addressed the issue in the right place. Perhaps you
can try out 1.13.0 and let us all know.

Thanks

Parth

On Sat, Mar 17, 2018 at 11:43 AM, Anup Tiwari <anup.tiw...@games24x7.com>
wrote:

> Thanks Parth for Info. I am really looking forward to it.
> But can you tell me if the second part(about hack) was right or not?
> Because i
> really want to test it as we got this issue several time in last 2-3 days
> post
> upgrading to 1.12.0.
> Also i have seen sometimes after lost connection , drillbit gets killed on
> few/all nodes and i am not getting any logs in drillbit.out/drillbit.log.
>
>
>
>
> On Fri, Mar 16, 2018 11:07 PM, Parth Chandra par...@apache.org  wrote:
> On Fri, Mar 16, 2018 at 8:10 PM, Anup Tiwari <anup.tiw...@games24x7.com>
>
> wrote:
>
>
>
>
> Hi All,
>>
>
> I was just going through this post and found very good suggestions.
>>
>
> But this issue is still there in Drill 1.12.0 and i can see
>>
>
> https://issues.apache.org/jira/browse/DRILL-4708 is now marked as
>>
>
> resolved in
>>
>
> "1.13.0" so i am hoping that this will be fixed in drill 1.13.0.
>>
>
> Few things i want to ask :-
>>
>
> - Any Planned date for Drill 1.13.0 release?
>>
>
>
>>
>
>
>
>
>
>
> Real Soon Now. :)
>
> The release will be out in a couple of days. Watch this list for an
>
> announcement.
>
>
>
>
>
>
> Regards,
> Anup Tiwari


[ANNOUNCE] Apache Drill release 1.13.0

2018-03-18 Thread Parth Chandra
On behalf of the Apache Drill community, I am happy to announce the release of
Apache Drill 1.13.0.

For information about Apache Drill, and to get involved, visit the
project website
[1].

This release of Drill provides the following new features and improvements:

- YARN support for Drill [DRILL-1170
]

- Support HTTP Kerberos auth using SPNEGO [DRILL-5425
]

- Support SQL syntax highlighting of queries [DRILL-5868
]

- Drill should support user/distribution specific configuration checks
during startup [DRILL-6068
]

- Upgrade DRILL to Calcite 1.15.0 [DRILL-5966
]

- Batch Sizing improvements to reduce memory footprint of operators

- [DRILL-6071 ]
- Limit batch size for flatten operator

- [DRILL-6126 ]
- Allocate memory for value vectors upfront in flatten operator

- [DRILL-6123 ]
- Limit batch size for Merge Join based on memory.

- [DRILL-6177 ]
- Merge Join - Allocate memory for outgoing value vectors based on sizes of
incoming batches.


For the full list please see release notes [2].

The binary and source artifacts are available here [3].

Thanks to everyone in the community who contributed to this release!

1. https://drill.apache.org/
2. https://drill.apache.org/docs/apache-drill-1-13-0-release-notes/
3. https://drill.apache.org/download/


Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2018-03-16 Thread Parth Chandra
On Fri, Mar 16, 2018 at 8:10 PM, Anup Tiwari 
wrote:

> Hi All,
> I was just going through this post and found very good suggestions.
> But this issue is still there in Drill 1.12.0 and i can see
> https://issues.apache.org/jira/browse/DRILL-4708 is now marked as
> resolved in
> "1.13.0" so i am hoping that this will be fixed in drill 1.13.0.
> Few things i want to ask :-
> - Any Planned date for Drill 1.13.0 release?
>


Real Soon Now.  :)
The release will be out in a couple of days. Watch this list for an
announcement.


Minutes - Drill Hangout Nov 28 2017

2017-11-28 Thread Parth Chandra
Attendees:  Rob, Arina, Paul, Vitali, Volodymyr, Prasad, Kunal, Padma,
Parth.


The only topic of discussion was the 1.12 release. Arina and Paul discussed
the last remaining issue and we decided that the graceful shutdown feature
will be off by default.
With that there are no outstanding issues for the release and Arina will be
rolling out the release candidate tomorrow morning.

Get ready to vote !

Parth


On Mon, Nov 27, 2017 at 7:44 PM, Parth Chandra <par...@apache.org> wrote:

> We'll have the Drill Hangout  tomorrow Nov 28th, at 10 AM PST.
> As usual, please send email if you have any topics to discuss or bring
> them up on the call.
>
> We'll start with the release plan :)
>
> Hangout link:
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>
> Thanks
>
> Parth
>


Drill Hangout Nov 28 2017

2017-11-27 Thread Parth Chandra
We'll have the Drill Hangout  tomorrow Nov 28th, at 10 AM PST.
As usual, please send email if you have any topics to discuss or bring them
up on the call.

We'll start with the release plan :)

Hangout link:
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks

Parth


Re: [HANGOUT] Topics for 10/31/2017

2017-10-31 Thread Parth Chandra
Looks like the call is full. Bunch of us are not able to join.
Can some other folks get together and join together?



On Tue, Oct 31, 2017 at 10:04 AM, Parth Chandra <par...@apache.org> wrote:

> Starting the hangout in a bit ...
>
> On Mon, Oct 30, 2017 at 1:12 PM, Timothy Farkas <tfar...@mapr.com> wrote:
>
>> I'll speak about unit testing:
>>
>>
>>  - Common mistakes that were made in unit tests
>>
>>  - The soon to be merged temp directory test watcher classes
>>
>>  - Making our Travis build run smoke tests with code coverage
>>
>>  - Other misc improvements to be made
>>
>> Thanks,
>> Tim
>>
>> 
>> From: Gautam Parai <gpa...@mapr.com>
>> Sent: Monday, October 30, 2017 9:26:34 AM
>> To: d...@drill.apache.org; user@drill.apache.org
>> Subject: [HANGOUT] Topics for 10/31/2017
>>
>> Hi,
>>
>> We will have a Drill hangout tomorrow (Tuesday Oct 31) at 10 AM Pacific
>> Time. Please suggest topics by replying to this thread or bring them up
>> during the hangout.
>>
>> Hangout link:  https://plus.google.com/hangou
>> ts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>>
>> Thanks,
>> Gautam
>>
>
>


Re: Exception while reading parquet data

2017-10-16 Thread Parth Chandra
Hi Projjwal,

  Unfortunately, I did not get a crash when I tried with your sample file.
Also if turning off buffered reader did not help, did you get a different
stack trace?

  Any more information you can provide will be useful. Is this part of a
larger query with more parquet files being read? Are you reading all the
columns? Is there some specific column that appears to trigger the issue?

  You can mail this info directly to me if you are not comfortable sharing
your data on the public list.

Thanks

Parth


On Mon, Oct 16, 2017 at 8:19 AM, PROJJWAL SAHA <proj.s...@gmail.com> wrote:

> here is the link for the parquet data.
> https://drive.google.com/file/d/0BzZhvMHOeao1S2Rud2xDS1NyS00/
> view?usp=sharing
>
> Setting store.parquet.reader.pagereader.bufferedread=false did not solve
> the issue.
>
> I am using Drill 1.11. The parquet data is fetched from Oracle Storage
> Cloud Service using swift driver.
>
> Here is the error on the drill command prompt -
> Error: DATA_READ ERROR: Exception occurred while reading from disk.
>
> File:
> /data1GBparquet/storereturns/part-0-7ce26fde-f342-4aae-
> a727-71b8b7a60e63.parquet
> Column:  sr_return_time_sk
> Row Group Start:  417866
> File:
> /data1GBparquet/storereturns/part-0-7ce26fde-f342-4aae-
> a727-71b8b7a60e63.parquet
> Column:  sr_return_time_sk
> Row Group Start:  417866
> Fragment 0:0
>
> On Sun, Oct 15, 2017 at 8:59 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>
> > You could try uploading to Google Drive (since you have a Gmail account)
> > and share the link .
> >
> > Did Parth's suggestion of
> > store.parquet.reader.pagereader.bufferedread=false
> > resolve the issue?
> >
> > Also share the details of the hardware setup... #nodes, Hadoop version,
> > etc.
> >
> >
> > -Original Message-
> > From: PROJJWAL SAHA [mailto:proj.s...@gmail.com]
> > Sent: Sunday, October 15, 2017 8:07 AM
> > To: user@drill.apache.org
> > Subject: Re: Exception while reading parquet data
> >
> > Is there any place where I can upload the 12MB parquet data. I am not
> able
> > to send the file through mail to the user group.
> >
> > On Thu, Oct 12, 2017 at 10:58 PM, Parth Chandra <par...@apache.org>
> wrote:
> >
> > > Seems like a bug in BufferedDirectBufInputStream.  Is it possible to
> > > share a minimal data file that triggers this?
> > >
> > > You can also try turning off the buffering reader.
> > >store.parquet.reader.pagereader.bufferedread=false
> > >
> > > With async reader on and buffering off, you might not see any
> > > degradation in performance in most cases.
> > >
> > >
> > >
> > > On Thu, Oct 12, 2017 at 2:08 AM, PROJJWAL SAHA <proj.s...@gmail.com>
> > > wrote:
> > >
> > > > hi,
> > > >
> > > > disabling sync parquet reader doesnt solve the problem. I am getting
> > > > similar exception I dont see any issue with the parquet file since
> > > > the same file works on loading the same on alluxio.
> > > >
> > > > 2017-10-12 04:19:50,502
> > > > [2620da63-4efb-47e2-5e2c-29b48c0194c0:frag:0:0] ERROR
> > > > o.a.d.e.u.f.BufferedDirectBufInputStream - Error reading from stream
> > > > part-0-7ce26fde-f342-4aae-a727-71b8b7a60e63.parquet. Error was :
> > > > null
> > > > 2017-10-12 04:19:50,506
> > > > [2620da63-4efb-47e2-5e2c-29b48c0194c0:frag:0:0] ERROR
> > > > o.a.d.exec.physical.impl.ScanBatch - SYSTEM ERROR:
> > > > IndexOutOfBoundsException
> > > >
> > > >
> > > > [Error Id: 3b7c4587-c1b8-4e79-bdaa-b2aa1516275b ]
> > > > org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> > > > IndexOutOfBoundsException
> > > >
> > > >
> > > > [Error Id: 3b7c4587-c1b8-4e79-bdaa-b2aa1516275b ]
> > > > at org.apache.drill.common.exceptions.UserException$
> > > > Builder.build(UserException.java:550)
> > > > ~[drill-common-1.11.0.jar:1.11.0]
> > > > at org.apache.drill.exec.physical.impl.ScanBatch.next(
> > > > ScanBatch.java:249)
> > > > [drill-java-exec-1.11.0.jar:1.11.0]
> > > > at org.apache.drill.exec.record.AbstractRecordBatch.next(
> > > > AbstractRecordBatch.java:119)
> > > > [drill-java-exec-1.11.0.jar:1.11.0]
> > > > at org.apache.drill.exec.record.AbstractRecordBatch.next(
> > > > AbstractRecordBatch.java:109)
> > > > [drill-jav

Re: Exception while reading parquet data

2017-10-12 Thread Parth Chandra
Seems like a bug in BufferedDirectBufInputStream.  Is it possible to share
a minimal data file that triggers this?

You can also try turning off the buffering reader.
   store.parquet.reader.pagereader.bufferedread=false

With async reader on and buffering off, you might not see any degradation
in performance in most cases.



On Thu, Oct 12, 2017 at 2:08 AM, PROJJWAL SAHA  wrote:

> hi,
>
> disabling sync parquet reader doesnt solve the problem. I am getting
> similar exception
> I dont see any issue with the parquet file since the same file works on
> loading the same on alluxio.
>
> 2017-10-12 04:19:50,502
> [2620da63-4efb-47e2-5e2c-29b48c0194c0:frag:0:0] ERROR
> o.a.d.e.u.f.BufferedDirectBufInputStream - Error reading from stream
> part-0-7ce26fde-f342-4aae-a727-71b8b7a60e63.parquet. Error was :
> null
> 2017-10-12 04:19:50,506
> [2620da63-4efb-47e2-5e2c-29b48c0194c0:frag:0:0] ERROR
> o.a.d.exec.physical.impl.ScanBatch - SYSTEM ERROR:
> IndexOutOfBoundsException
>
>
> [Error Id: 3b7c4587-c1b8-4e79-bdaa-b2aa1516275b ]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> IndexOutOfBoundsException
>
>
> [Error Id: 3b7c4587-c1b8-4e79-bdaa-b2aa1516275b ]
> at org.apache.drill.common.exceptions.UserException$
> Builder.build(UserException.java:550)
> ~[drill-common-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.ScanBatch.next(
> ScanBatch.java:249)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractSingleRecordBatch.
> innerNext(AbstractSingleRecordBatch.java:51)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:162)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractSingleRecordBatch.
> innerNext(AbstractSingleRecordBatch.java:51)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.svremover.
> RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:162)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractSingleRecordBatch.
> innerNext(AbstractSingleRecordBatch.java:51)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext(ProjectRecordBatch.java:133)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:162)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.aggregate.
> HashAggBatch.buildSchema(HashAggBatch.java:111)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:142)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.xsort.
> ExternalSortBatch.buildSchema(ExternalSortBatch.java:264)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:142)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> [drill-java-exec-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.record.AbstractSingleRecordBatch.
> 

Re: 1.11.0 RC question

2017-07-26 Thread Parth Chandra
You might have to add the pcap format in the dfs storage plugin config [1]

Something like this :

"formats": {
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "delimiter": ","
},
"parquet": {
  "type": "parquet"
},
"json": {
  "type": "json",
  "extensions": [
"json"
  ]
},
"abc": {
  "type": "json",
  "extensions": [
"abc"
  ]
},
"pcap": {
  "type": "pcap",
  "extensions": [
"pcap"
  ]
}
  }

 Then you can specify "cap" as the the default format for any workspace.
Configuring that is also described in [1]

Parth

[1] https://drill.apache.org/docs/plugin-configuration-basics/

On Wed, Jul 26, 2017 at 1:36 PM, Jinfeng Ni  wrote:

> Hi Bob,
>
> Is DRILL-5432 the one you are talking about? I saw it's merged and should
> have been put in the release candidate.
>
> What type of error did you see when you tried to query a PCAP? Also, it may
> help to provide the commit id of your build, by run the following query:
>
> SELECT * from sys.version;
>
>
> https://issues.apache.org/jira/browse/DRILL-5432
>
>
> On Wed, Jul 26, 2017 at 1:03 PM, Bob Rudis  wrote:
>
> > I wasn't sure if this belonged on the dev list or not but I was
> > peeking around the JIRA for 1.11.0 RC and noticed that it _looked_
> > like PCAP support is/was going to be in 1.11.0 but when I did a quick
> > d/l and test of the RC (early yesterday) and tried to query a PCAP it
> > did not work.
> >
> > I'm wondering if I just grabbed a too-early RC and shld try again or
> > if PCAP support will miss the 1.11.0 release. (I might have misread a
> > tweet from Ted that seemed to suggest it might not make it for
> > 1.11.0).
> >
> > If it's the latter, will that mean the mapr github pcap drill example
> > shld work as an interim substitute until 1.12.0 (NOTE: I haven't tried
> > that yet)?
> >
> > If PCAP support had not previously, actually made the cut for 1.11.0
> > RC can I make a last-minute req to have it be included? :-)
> >
> > thx for the hard work by the dev team. I ended up scanning through all
> > the JIRAs and that's _alot_ of work. it's definitely appreciated.
> >
> > thx,
> >
> > -Bob
> >
>


Re: [HANGOUT] Minutes for 7/11/17

2017-07-11 Thread Parth Chandra
Apache Drill Hangout - 2017-07-11

Attendees - Aman, Boaz, Kunal, Pritesh, Paul, Parth, Arina, Jinfeng, Nate,
Peachy, Roman, Vitali, Vova, Rob

1) 1.11.0 Release

 a)Arina to do a prebuild release by Thurday. Cutoff for Friday provided
the following show stoppers are addressed :
   DRILL-5660 - Vitali's metadata fix needs a metadata version change.
   DRILL-5468 - Performance regression in tpch-18 since a Calcite master
merge. Jinfeng is investigating.
   DRILL-5659 - Rob pointed out that the C++ client APIs are unstable.
getColumns returns unable to decode. Occurs since sasl encryption was
added. Parth to investigate.
   No Jira yet. New issue with an NPE during performance testing. Paul
investigating.

  b) Other Items
 DRILL-5616, DRILL-5665 (from Boaz)  - Paul to review.
 DRILL-5634 - (from Charles) Parth to confirm if we can include AES.
Depends on apache-commons.

  c) The following are deferred to 1.12.0
DRILL-5663
DRILL-5540
DRILL-5442
DRILL-4253


2) John Omernik's REST API authentication question -
   We need a design of the REST APIs. We're looking at implementing SPNEGO,
can maybe take a look at a shorter term solution as part of that.

3) Apache Drill twitter - need the dev team to engage on twitter. Pritesh
will try to work on getting folks more engaged.

4) Pritesh has a new dashboard in Apache Jira that gives a nice overview of
issues in progress. Pritesh will post on the dev list.


Thanks everyone for attending.



On Tue, Jul 11, 2017 at 10:00 AM, Parth Chandra <par...@apache.org> wrote:

> Starting now ..
>
> On Mon, Jul 10, 2017 at 11:33 AM, Parth Chandra <par...@apache.org> wrote:
>
>> We'll have the hangout tomorrow at the usual time. Any topics people want
>> discussed?
>>
>>
>>
>


Re: [HANGOUT] Topics for 7/11/17

2017-07-11 Thread Parth Chandra
Starting now ..

On Mon, Jul 10, 2017 at 11:33 AM, Parth Chandra <par...@apache.org> wrote:

> We'll have the hangout tomorrow at the usual time. Any topics people want
> discussed?
>
>
>


[HANGOUT] Topics for 7/11/17

2017-07-10 Thread Parth Chandra
We'll have the hangout tomorrow at the usual time. Any topics people want
discussed?


Re: Drill Summit/Conference Proposal

2017-06-19 Thread Parth Chandra
I'm a little late coming into this conversation, but wanted to chip in and
add my +1 to the idea of working on getting Drill more exposure. Thanks
Charles for taking this up!



On Sun, Jun 18, 2017 at 7:58 AM, Charles Givre  wrote:

> All,
> I’d like to thank everyone for their responses and ideas for a
> DrillCon/Meetup/GetTogether/Whatever….  (Please don’t stop)
> At this point, I’m going to contact the organizers of ApacheCon and OsCon
> and see if either (or both) would be willing to start a track for Drill.
> I’ll keep the board posted as to my progress.
> Please don’t stop the conversation, I’d really like to everyone’s input
> and ideas.  Hopefully this goes somewhere!
> — C
>
>
>
> > On Jun 18, 2017, at 10:55, Ted Dunning  wrote:
> >
> > On Sat, Jun 17, 2017 at 11:03 PM, Charles Givre 
> wrote:
> >
> >> I've never been but what about OsCon?
> >>
> >
> > Great option. It is bigger and better attended than ApacheCon (lately).
> And
> > they allow specialized tracks.
>
>


Re: R interface to Drill heading to CRAN (last call for issues/features)

2017-06-19 Thread Parth Chandra
Hi Bob,

  This is cool stuff. Glad you posted a link to it.
  If you have any thoughts on improvements to Drill's APIs that would help
your effort, please post on the dev list.

Parth

On Sat, Jun 17, 2017 at 6:11 PM, Bob Rudis  wrote:

>
>
> Most recently, Drill + sergeant & R were used to analyze the results
> of 30 TCP port scans of over 160 million internet hosts in one of our
> annual cybersecurity research efforts at Rapid7 (ref:
> https://www.rapid7.com/data/national-exposure/2017.html).
>
> Many thanks, also, to the Drill dev team. It's an awesome tool & ecosystem.
>
> -Bob
>


Re: ideal drill node size

2017-02-07 Thread Parth Chandra
I would second John's suggestion that you should try a single large
machine, taking care to get your memory settings right. In general, Drill
will use both CPU and memory, and in your setup, you will probably get
better (and more predictable) performance with a single node setup.
As John mentioned, please share your results.


On Mon, Feb 6, 2017 at 10:22 AM, John Omernik  wrote:

> I think you would be wasting quite a bit of your server if you split it up
> into multiple vms. Instead, I am thinking a larger drill bit size wise
> (ensure you are upping your ram as high as you can) would be best.  Note I
> am not an expert on this stuff, I would like an experts take as well. Here
> is a link on configuring Drill memory:
> https://drill.apache.org/docs/configuring-drill-memory/
>
> Another thing with such a heavy weight server is you will likely need to
> adjust defaults in memory to take advantage of more of the memory. (Drill
> folks correct me if I am wrong). Settings like
> planner.memory.max_query_memory_per_node
>  introduction/#system-options>
>  Will
> need need to be setup to take advantage of more of your memory.  It will be
> very interesting to see where the bottleneck in a setup like yours is...
> please share results!
>
>
>
> On Sat, Feb 4, 2017 at 11:37 AM, Christian Hotz-Behofsits <
> chris.hotz.behofs...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a big server (512gb RAM, 32 cores, storage is connected by FC) and
> > want to use it for drill. Should I split it into several VM and build a
> > cluster or should I use it as a single node? I know that splitting would
> > introduce a overhead (guest-OS). But a cluster might provide better
> > parellization of the tasks.
> >
> >
> >- Are there any best practices in terms of node size (memory, CPU)?
> >- Does drill favor a lot of small nodes or few (one) big node?
> >
> >
> > cheers
> >
>


Re: Additional information on JSON_SUB_SCAN operator and access to query profiles not from the Web UI

2017-01-19 Thread Parth Chandra

The time in the query profile is the exact time. The console (sqlline) will 
include time it took to get data back from the server (and possibly time to 
format/display the data).

Query planning and optimization time is not explicitly reported but you can get 
an idea of it by looking at the first start time for all the fragments. The 
earliest start time will be the time the server took to plan and optimize.




From: Nikos R. Katsipoulakis <nick.kat...@gmail.com>
Sent: Thursday, January 19, 2017 4:10:59 PM
To: user@drill.apache.org
Subject: Re: Additional information on JSON_SUB_SCAN operator and access to 
query profiles not from the Web UI

Hello again,

Thank you Parth for your suggestions! I will try to follow your
instructions and do something like you suggested on the server.

In addition, I noticed something odd: When I execute a query on Drill's
console (terminal) I get an execution time (let's say) X. When I get the
execution profile from the profiler on the Web Console, I see that an
execution time Y is reported, which is always less than X. From what I
understand, the profiler does not include in its timer some additional
operations, which are included on the time reported on the Drill Console.
Why does the previous happen? Is there any chance that in the execution
times reported in Drill's console are included additional startup costs for
a query (like query parsing, evaluation, optimization etc.)? If yes, can I
get an exact breakdown of the time spent for a query?

Thank you,
Nikos

On Thu, Jan 19, 2017 at 5:48 PM, Parth Chandra <pchan...@mapr.com> wrote:

> JSON_SUB_SAN is the Json reader. It uses Jackson to do actual parsing, and
> converts the data into Drill's internal value vector format. TEXT_SUB_SCAN
> is the corresponding operator for csv.
>
> If the Drill system has access to the /log/profile directory then you can,
> in fact, use Drill to query the json in the query profile. You might want
> to setup an nfs location for the query profiles,so that the directory is
> visible to all drillbits.  The simply create a new workspace pointing to
> the directory. You will be able to read the profiles like any other Json
> file.
>
> 
> From: Nikos R. Katsipoulakis <nick.kat...@gmail.com>
> Sent: Wednesday, January 11, 2017 7:37:30 AM
> To: user@drill.apache.org
> Subject: Additional information on JSON_SUB_SCAN operator and access to
> query profiles not from the Web UI
>
> Hello all,
>
> I am a new user of Apache Drill and I am in the process of better
> understanding its internals. To that end, I have two questions, for which I
> was unable to find more information online.
>
> First, when I execute an EXPLAIN command for a query that gets its data
> from JSON files, I see a physical operator named JSON_SUB_SCAN. What does
> that operator exactly do? Is it only used for parsing (extracting) fields
> from JSON data? Or does it perform additional processing? As far as I know,
> Drill uses Jackson Streaming API for extracting JSON data. Is that still
> true? Finally, what is the equivalent operator for CSV files?
>
> Second, I need to access query profiles from a server that is behind a
> firewall. Therefore, accessing the URL of that machine on port 8047 is a
> headache (since I have to submit a ticket to IT Support). My question is
> whether I can access the Query Profiles in any other way? Like from the
> sqlline or through log/profile files created while executing queries.
>
> Thank you and Kind Regards,
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>



--
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh


Re: Additional information on JSON_SUB_SCAN operator and access to query profiles not from the Web UI

2017-01-19 Thread Parth Chandra
JSON_SUB_SAN is the Json reader. It uses Jackson to do actual parsing, and 
converts the data into Drill's internal value vector format. TEXT_SUB_SCAN is 
the corresponding operator for csv.

If the Drill system has access to the /log/profile directory then you can, in 
fact, use Drill to query the json in the query profile. You might want to setup 
an nfs location for the query profiles,so that the directory is visible to all 
drillbits.  The simply create a new workspace pointing to the directory. You 
will be able to read the profiles like any other Json file.


From: Nikos R. Katsipoulakis 
Sent: Wednesday, January 11, 2017 7:37:30 AM
To: user@drill.apache.org
Subject: Additional information on JSON_SUB_SCAN operator and access to query 
profiles not from the Web UI

Hello all,

I am a new user of Apache Drill and I am in the process of better
understanding its internals. To that end, I have two questions, for which I
was unable to find more information online.

First, when I execute an EXPLAIN command for a query that gets its data
from JSON files, I see a physical operator named JSON_SUB_SCAN. What does
that operator exactly do? Is it only used for parsing (extracting) fields
from JSON data? Or does it perform additional processing? As far as I know,
Drill uses Jackson Streaming API for extracting JSON data. Is that still
true? Finally, what is the equivalent operator for CSV files?

Second, I need to access query profiles from a server that is behind a
firewall. Therefore, accessing the URL of that machine on port 8047 is a
headache (since I have to submit a ticket to IT Support). My question is
whether I can access the Query Profiles in any other way? Like from the
sqlline or through log/profile files created while executing queries.

Thank you and Kind Regards,

--
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh


Apache Drill Hangout minutes - 2016-12-13

2016-12-13 Thread Parth Chandra
Attendees: Arina, Boaz, Chunhui, Gautam, Karthikeyan, Khurram, Padma,
Parth, Roman, Paul, Serhiy, Sonny, Vitalii.

Serhiy - JIRA status workflow suggestion: admin can create workflows. Needs
to be a Apache infrastructure person to change this and we don't know how
easy it will be.

Karthik - Netty version should be upgraded. WE cannot do this because there
is problem with increased memory usage that is fixed only in the version
Drill uses. Subsequent releases of Netty undid the change. We need to try
out the new versions and if the problem has been reintroduced, work with
the Netty team to get it fixed.

Khurram - Question about Calcite rebase; many recent issues logged in Drill
have been fixed in Calcite. Roman testing dynamic UDFs, then will work on
Calcite.

Vitalii - Hive UDFs. built in functions especially date functions, can be
used by different storage plugins, but cannot use it from the test
methods.  Can use this only in the hive module, which is expected.

Sonny - Student data 12-15K universities, schools. Building a data lake,
and running analytics.  Looking at using filter pushdown capabilities.

Arina - design doc repository.
Temp tables - design doc is now on gist. Can we create a doc hub for Drill?
Or put into a gist doc and contributors can submit pull request.  We should
put the docs where they are archived. Also where they can be reviewed
easily.
Current best method - google docs, then after review in the contributors
github gist. Suggestion to put these docs in the Apache github.

Arina - temp tables
Is it worth creating a user specific temporary workspace? Can create any
tables, etc., but temp tables always only go here. If the user workspace
does not exist, temp tables to to temp workspace. Concerns about creating
tables in user workspace that can cause re-computation of statistics. Paul,
Arina will discuss offline.
Two temp tables with the same name as a persistent table (that might
already exist). Currently allowing it, and if both exist, then give the
temp table precedence. Concern that this is not right as there is no way to
really disambiguate.


hangout starting in a bit

2016-12-13 Thread Parth Chandra



Re: hangout starting in a bit

2016-12-13 Thread Parth Chandra
hangout is on:
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


On Tue, Dec 13, 2016 at 10:02 AM, Khurram Faraaz <kfar...@maprtech.com>
wrote:

> Can we please have the link to the hangout ?
>
> On Tue, Dec 13, 2016 at 11:32 PM, Parth Chandra <par...@apache.org> wrote:
>
> >
> >
>


Re: Drill read EBCDIC format plugin

2016-10-28 Thread Parth Chandra
It isn't possible out of the box, but Drill is extensible and you can write
a format plugin to process EBCDIC data. There is not much documentation on
how to write one,
but org.apache.drill.exec.store.easy.text.TextFormatPlugin is an example of
how a format plugin in written.



On Fri, Oct 28, 2016 at 4:30 PM, Anton Kravchenko <
kravchenko.anto...@gmail.com> wrote:

> Hi there,
>
> Just curious, is it feasible at all for Drill to support reading from
> EBCDIC files? If so, it would have a big value.
>
>
> Thank you,
> Anton
>


Re: Redefining existing data into something like a View

2016-10-24 Thread Parth Chandra
Yes, you can create views [1]
However, Drill does not support CRUD being primarily for read only
analytical queries. (The only write operation is 'create table as' [2] )





[1] https://drill.apache.org/docs/create-view/
[2] http://drill.apache.org/docs/create-table-as-ctas/

On Sun, Oct 23, 2016 at 11:47 PM, steffen schuler  wrote:

> Hi Drill-Users,
>
> I have a fundamental question about refining existing data structures with
> drill.
>
> Is it possible with drill to (re-)define something like a „View“
> (combination of existing
> data tables) and have all of the CRUD functionality available against this
> new „View“?
>
> Any hint is welcome : )
>
> Kind Regards,
>
> Steffen Schuler


Re: [HANGOUT] Topics for 10/04/16

2016-10-05 Thread Parth Chandra
Yup. I agree that we need to make sure that both clients are in sync. I
believe DRILL-4280's PR refers to making changes in both APIs as well.

Do you have a sense of how these changes give us a performance boost? As
far as I can see, the APIs result in nearly the same code path being
executed, with the difference being that the limit 0 query is now submitted
by the server instead of the client.

I don't know much about the tweaking of performance for various BI tools;
is there something that Tableau et al do different? I don't see how, since
the the ODBC/JDBC interface remains the same. Just trying to understand
this.

Anyway, any performance gain is wonderful. Do you have any numbers to share?


On Tue, Oct 4, 2016 at 10:29 AM, Jacques Nadeau <jacq...@dremio.com> wrote:

> Both the C++ and the JDBC changes are updates that leverage a number of
> pre-existing APIs already on the server. Our initial evaluations, we have
> already seen substantially improved BI tool performance with the proposed
> changes (with no additional server side changes). Are you seeing something
> different? If you haven't yet looked at the changes in that light, I
> suggest you do.
>
> If anything, I'm more concerned about client feature proposals that don't
> cover both the C++ and Java client. For example, I think we should be
> cautious about merging something like DRILL-4280. We should be cautious
> about introducing new server APIs unless there is a concrete plan around
> support in all clients.
>
> So I agree with the spirit of your ask: change proposals should be
> "complete". However, I don't think it reasonably applies to the changes
> proposed by Laurent. His changes "complete" the already introduced metadata
> and prepare apis the server exposes. It provides an improved BI user
> experience. It also introduces unit tests in the C++ client, something that
> was previously sorely missing.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Oct 4, 2016 at 9:47 AM, Parth Chandra <pchan...@maprtech.com>
> wrote:
>
> > Hi guys,
> >
> >   I won't be able to join the hangout but it would be good to discuss the
> > plan for the related backend changes.
> >
> >   As I mentioned before I would like to see a concrete proposal for the
> > backend that will accompany these changes. Without that, I feel there is
> no
> > point to adding so much new code.
> >
> > Thanks
> >
> > Parth
> >
> >
> > On Mon, Oct 3, 2016 at 7:52 PM, Laurent Goujon <laur...@dremio.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm currently working on improving metadata support for both the JDBC
> > > driver and the C++ connector, more specifically the following JIRAs:
> > >
> > > DRILL-4853: Update C++ protobuf source files
> > > DRILL-4420: Server-side metadata and prepared-statement support for C++
> > > connector
> > > DRILL-4880: Support JDBC driver registration using ServiceLoader
> > > DRILL-4925: Add tableType filter to GetTables metadata query
> > > DRILL-4730: Update JDBC DatabaseMetaData implementation to use new
> > Metadata
> > > APIs
> > >
> > > I  already opened multiple pull requests for those (the list is
> available
> > > at https://github.com/apache/drill/pulls/laurentgo)
> > >
> > > I'm planning to join tomorrow hangout in case people have questions
> about
> > > those.
> > >
> > > Cheers,
> > >
> > > Laurent
> > >
> > > On Mon, Oct 3, 2016 at 10:28 AM, Subbu Srinivasan <
> > ssriniva...@zscaler.com
> > > >
> > > wrote:
> > >
> > > > Can we close on https://github.com/apache/drill/pull/518 ?
> > > >
> > > > On Mon, Oct 3, 2016 at 10:27 AM, Sudheesh Katkam <
> sudhe...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi drillers,
> > > > >
> > > > > Our bi-weekly hangout is tomorrow (10/04/16, 10 AM PT). If you have
> > any
> > > > > suggestions for hangout topics, you can add them to this thread. We
> > > will
> > > > > also ask around at the beginning of the hangout for topics.
> > > > >
> > > > > Thank you,
> > > > > Sudheesh
> > > > >
> > > >
> > >
> >
>


Re: [HANGOUT] Topics for 10/04/16

2016-10-04 Thread Parth Chandra
Hi guys,

  I won't be able to join the hangout but it would be good to discuss the
plan for the related backend changes.

  As I mentioned before I would like to see a concrete proposal for the
backend that will accompany these changes. Without that, I feel there is no
point to adding so much new code.

Thanks

Parth


On Mon, Oct 3, 2016 at 7:52 PM, Laurent Goujon  wrote:

> Hi,
>
> I'm currently working on improving metadata support for both the JDBC
> driver and the C++ connector, more specifically the following JIRAs:
>
> DRILL-4853: Update C++ protobuf source files
> DRILL-4420: Server-side metadata and prepared-statement support for C++
> connector
> DRILL-4880: Support JDBC driver registration using ServiceLoader
> DRILL-4925: Add tableType filter to GetTables metadata query
> DRILL-4730: Update JDBC DatabaseMetaData implementation to use new Metadata
> APIs
>
> I  already opened multiple pull requests for those (the list is available
> at https://github.com/apache/drill/pulls/laurentgo)
>
> I'm planning to join tomorrow hangout in case people have questions about
> those.
>
> Cheers,
>
> Laurent
>
> On Mon, Oct 3, 2016 at 10:28 AM, Subbu Srinivasan  >
> wrote:
>
> > Can we close on https://github.com/apache/drill/pull/518 ?
> >
> > On Mon, Oct 3, 2016 at 10:27 AM, Sudheesh Katkam 
> > wrote:
> >
> > > Hi drillers,
> > >
> > > Our bi-weekly hangout is tomorrow (10/04/16, 10 AM PT). If you have any
> > > suggestions for hangout topics, you can add them to this thread. We
> will
> > > also ask around at the beginning of the hangout for topics.
> > >
> > > Thank you,
> > > Sudheesh
> > >
> >
>


Hangout starting now

2016-09-06 Thread Parth Chandra
Please join us ...

Hangout link -
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Topics for next hangout (2016-09-06) ...

2016-09-02 Thread Parth Chandra
Hi everyone,

  The next hangout is on Sept 6 (after the long weekend here). If there are
any topics folks wish to discuss, please let me know so that other who
might be interested can also join. You can always join and bring up new
topics at the last minute.

  I have a couple of topics to discuss.
  1) Release cadence - review the release cadence based on the experience
of recent releases
  2) Design documents - some larger features being worked on have used a
common template that makes it easier to review the proposal as well as
document it once the feature is done. I'd like to solicit some feedback on
that.

Thanks

Parth


Re: Drill UDF Listing

2016-08-25 Thread Parth Chandra
Hi Charles, thanks for starting this.

Here are some the Ted Dunning wrote and posted a while ago -
https://github.com/mapr-demos/simple-drill-functions



On Thu, Aug 25, 2016 at 9:48 AM, Charles Givre  wrote:

> Hello everyone,
> I’ve decided to create a list of Drill UDFs that are out there.  I’m not
> hosting anything, just posting links to UDFs that I’ve verified actually
> work.  Here is the link: http://thedataist.com/drill-udfs/  and if anyone
> has others, please send me the info and I’ll put it up.
> Thanks,
> — Charles
>
>
>


Re: ODBC Connection, MySQL, and "T LIMIT 0"

2016-08-25 Thread Parth Chandra
That's a bug all right. Could you log a JIRA for this along with all the
information you can add (including this log output)?


On Wed, Aug 24, 2016 at 2:26 PM, Christopher Altman <calt...@emcien.com>
wrote:

> Here is the log file:
>
> 2016-08-24 21:22:46,910 [2841efd8-fe6f-38dc-b4cc-a01204b141e0:foreman]
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 2841efd8-fe6f-38dc-b4cc-a01204b141e0: SELECT * FROM (SELECT * FROM
> mysql.scan_test_data.acme_sales) T LIMIT 0
> 2016-08-24 21:22:47,396 [2841efd8-fe6f-38dc-b4cc-a01204b141e0:foreman]
> ERROR o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR:
> NullPointerException
>
>
> [Error Id: f9661104-b930-4a78-b290-9c2aeda49807 on 10.0.1.164:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> NullPointerException
>
>
> [Error Id: f9661104-b930-4a78-b290-9c2aeda49807 on 10.0.1.164:31010]
> at org.apache.drill.common.exceptions.UserException$
> Builder.build(UserException.java:543) ~[drill-common-1.7.0.jar:1.7.0]
> at 
> org.apache.drill.exec.work.foreman.Foreman$ForemanResult.close(Foreman.java:791)
> [drill-java-exec-1.7.0.jar:1.7.0]
> at 
> org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:901)
> [drill-java-exec-1.7.0.jar:1.7.0]
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:271)
> [drill-java-exec-1.7.0.jar:1.7.0]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_111]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_111]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> Caused by: org.apache.drill.exec.work.foreman.ForemanException:
> Unexpected exception during fragment initialization: null
> ... 4 common frames omitted
> Caused by: java.lang.NullPointerException: null
> at org.apache.drill.exec.planner.sql.handlers.FindLimit0Visitor$
> FindHardDistributionScans.visit(FindLimit0Visitor.java:262)
> ~[drill-java-exec-1.7.0.jar:1.7.0]
> at org.apache.calcite.rel.core.TableScan.accept(TableScan.java:166)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visitChild(RelShuttleImpl.java:53)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visitChildren(RelShuttleImpl.java:68)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visit(RelShuttleImpl.java:126)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.AbstractRelNode.accept(AbstractRelNode.java:256)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visitChild(RelShuttleImpl.java:53)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visitChildren(RelShuttleImpl.java:68)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.RelShuttleImpl.visit(RelShuttleImpl.java:126)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at 
> org.apache.calcite.rel.AbstractRelNode.accept(AbstractRelNode.java:256)
> ~[calcite-core-1.4.0-drill-r11.jar:1.4.0-drill-r11]
> at org.apache.drill.exec.planner.sql.handlers.FindLimit0Visitor.
> containsLimit0(FindLimit0Visitor.java:129) ~[drill-java-exec-1.7.0.jar:1.
> 7.0]
> at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.
> convertToDrel(DefaultSqlHandler.java:259) ~[drill-java-exec-1.7.0.jar:1.
> 7.0]
> at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.
> convertToDrel(DefaultSqlHandler.java:286) ~[drill-java-exec-1.7.0.jar:1.
> 7.0]
> at org.apache.drill.exec.planner.sql.handlers.
> DefaultSqlHandler.getPlan(DefaultSqlHandler.java:168)
> ~[drill-java-exec-1.7.0.jar:1.7.0]
> at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:94)
> ~[drill-java-exec-1.7.0.jar:1.7.0]
> at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
> [drill-java-exec-1.7.0.jar:1.7.0]
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:257)
> [drill-java-exec-1.7.0.jar:1.7.0]
> ... 3 common frames omitted
>
> Please let me know if you meant a different log.
>
>
>
> Thank you,
> Chris Altman
>
>
>
> > On Aug 24, 2016, at 5:11 PM, Parth Chandra <pchan...@maprtech.com>
> wrote:
> >
> > I think the ODBC driver encapsulates the original query in a LIMIT 0
> query
> > to determine the types of the

Re: How to find a jar pentaho:mondrian-data-foodmart-json:0.3.2

2016-08-24 Thread Parth Chandra
Hi Satish,
  You appear to be on 1.0.0-m2-incubating-SNAPSHOT which seems like a
really old snapshot. Would you like to use the latest? You can find it at
[1]

Parth

[1] https://drill.apache.org/download/



On Mon, Aug 22, 2016 at 11:04 PM, Satish Londhe <
lsat...@findabilitysciences.com> wrote:

> Hello sir ,
>
> it gives a error of
>
> [ERROR] Failed to execute goal on project drill-java-exec: Could not
> resolve dependencies for project
> org.apache.drill.exec:drill-java-exec:jar:1.0.0-m2-incubating-SNAPSHOT:
> Could not find artifact pentaho:mondrian-data-foodmart-json:jar:0.3.2 in
> conjars (http://conjars.org/repo) -> [Help 1]
> [ERROR]
>
>
>
> how to tackle this problem  ?
>
> please do the needful.
>


Re: about the way to load customized data

2016-08-22 Thread Parth Chandra
DRILL-3149 (and the related DRILL-4746) fix issues with reading multi-byte
delimiters.
You can maybe build from master or wait for the 1.8 release (due out soon
as we are in the release process).

HTH.



On Mon, Aug 22, 2016 at 12:32 AM,  wrote:

> Hi,
>
> I'd like to try drill in my project. But the source data has quite strange
> format which delimited by "|^|" and has line end "|!|". the example likes
> below.
>
> Aaa|^|bbb|^|ccc|!|
>
> So what's the best way to accomplish the data loading? Shall I extend the
> storage plugin.
>
> Thanks & Kind regards
> Shawn
>


Re: Suggestions for hangout topics for 08/09

2016-08-09 Thread Parth Chandra
Copy of Mehant's doc here:

https://docs.google.com/document/d/1IsfU3JwW8VF8Zyra7FIRoGrvyDe87iFJHTXGLbpD36U/edit?usp=sharing



On Tue, Aug 9, 2016 at 11:15 AM, Gautam Parai  wrote:

> Minutes from the hangout
>
> Attendees: Alok, Aman, Arina, Dave, Jason, Paul, Subbu, Vitalii, Zelaine,
> Padma, Jinfeng, Parth, Gautam
>
> 1. 1.8 RELEASE
> 4836 - Regression from 1.4 (February) Pull req open. Trying to fix in 1.8.
> Sudheesh will review it.
> 4766 - PR from Hakim. Is it a regression? If no, then not blocking the
> release.
>
> 2. DRILL-4653
> Set options not transparent - ses/sys by admin but not by user. JSON
> exactly consistent data - while aggregating
> How many records were dropped would not be known.
> Jason -
> JSON parsing error - skips the rest of the file. how to handle it?
> Aggregated warning - good but not must.
> TODO: Verify if doc changes needed NOT behavior changes. Will look at the
> PR to discuss it.
> Paul-
> Skipping records a problem - can't be enabled by default.
> Aman  -
> How many different JSON files was it tested with? Subbu - Few testcases by
> injecting failures at different locations.
> Subbu -
> TODO: Need to add more unit tests.
>
> 3. DRILL-4704
> Dave -
> Seen similar issues
> Decimal varying width format -> Parquet issues 4184 - Pull request open
> Select * from Parquet where emp_id=100 - Issues with Cast int to decimal
> without precision - Pull request 517
> DRILL-4834 - decimal implementation is vulnerable to overflow errors - To
> address issues and simplify implementation
> Aman -
> Good idea to look at previous design
> TODO: Parth/Aman Attach - Mehant's doc to JIRA
>
> Please correct me if I missed anything.
>
> Gautam
>
> On Tue, Aug 9, 2016 at 9:59 AM, Gautam Parai  wrote:
>
> > The hangout will start shortly. Here is the link:
> > https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
> >
> > On Mon, Aug 8, 2016 at 12:47 PM, Jason Altekruse 
> wrote:
> >
> >> Yeah, I can join the hangout tomorrow to talk about the PR, thanks for
> the
> >> heads up.
> >>
> >> Jason Altekruse
> >> Software Engineer at Dremio
> >> Apache Drill Committer
> >>
> >> On Mon, Aug 8, 2016 at 12:09 PM, Zelaine Fong 
> wrote:
> >>
> >> > Jason -- will you be able to join tomorrow's hangout, since you had
> >> raised
> >> > questions about Subbu's pull request?
> >> >
> >> > -- Zelaine
> >> >
> >> >
> >> > On Mon, Aug 8, 2016 at 11:33 AM, Gautam Parai 
> >> wrote:
> >> >
> >> >> Tomorrow's hangout is scheduled for 10AM - 11AM PST
> >> >>
> >> >> On Mon, Aug 8, 2016 at 11:30 AM, Subbu Srinivasan <
> >> >> ssriniva...@zscaler.com>
> >> >> wrote:
> >> >>
> >> >> > What time is tomorrow's mtg scheduled for?
> >> >> >
> >> >> >
> >> >> > On Mon, Aug 8, 2016 at 10:48 AM, Gautam Parai  >
> >> >> wrote:
> >> >> >
> >> >> > > If you have any suggestions for Drill hangout topics for
> tomorrow,
> >> >> you
> >> >> > can
> >> >> > > add it to this thread.  We will also ask around at the beginning
> of
> >> >> the
> >> >> > > hangout for any topics.  We will try to cover whatever possible
> >> during
> >> >> > the
> >> >> > > 1 hr.
> >> >> > >
> >> >> > > Topics:
> >> >> > >   1.  DRILL-4653:  Malformed JSON should not stop the entire
> query
> >> >> from
> >> >> > > progressing.
> >> >> > >Discussion about the PR.
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>


Re: Initial Feed Back on 1.7.0 Release

2016-07-05 Thread Parth Chandra
You might have run the two queries while the cache was still being built.
There is no concurrency control for the metadata cache at the moment (one
of the many improvements we need to make).
For metadata caching, the best practice with the current implementation is
to run a manual refresh metadata command at the top level directory after
adding any data.



On Tue, Jul 5, 2016 at 10:20 AM, Abdel Hakim Deneche 
wrote:

> Actually, I slightly misunderstood your 2nd question: so you made some
> changes to a subfolder, then run query A that caused the cache to refresh,
> then you run another query B that also caused the cache to refresh, the
> finally query C actually seemed to use the cache as it is.
>
> Is my understanding now correct ? are queries A and B exactly the same or
> different ?
>
> On Tue, Jul 5, 2016 at 10:13 AM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
> > John,
> >
> > Once you add/update data in one of your sub-folders, the immediate next
> > query should update the metadata cache automatically and all subsequent
> > queries should fetch metadata from the cache. If this is not the case,
> its
> > a bug. Can you confirm your findings?
> >
> > - Rahul
> >
> > On Tue, Jul 5, 2016 at 9:53 AM, John Omernik  wrote:
> >
> > > Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> > > understood, nothing was changed, but then I had to make the third query
> > for
> > > it to take.  I'll keep observing to determine what that may be.
> > >
> > > On 3, a logical place to implement, or start implementing incremental
> may
> > > be allowing a directories refresh automatically update the parents data
> > > without causing a cascading (update everything) refresh.  So if if I
> > have a
> > > structure like this:
> > >
> > > mytable
> > > ...dir0=2016-06-06
> > > ...dir1=23
> > >
> > > (basically table, days, hours)
> > >
> > > that if I update data in hour 23, it would update 2016-06-06 with the
> new
> > > timestamps and update mytable with the new timestamps.  The only issue
> > > would be figuring out a way to take a lock. (Say you had multiple loads
> > > happening, you want to ensure that one days updates don't clobber
> another
> > > days)
> > >
> > > Just a thought on that.
> > >
> > > Yep, the incremental issue would come into play here.  Are there any
> > design
> > > docs or JIRAs on the incremental updates to metadata?
> > >
> > > Thanks for your reply.  I am looking forward other dev's thoughts on
> your
> > > answer to 3 as well.
> > >
> > > Thanks!
> > >
> > > John
> > >
> > >
> > > On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > answers inline.
> > > >
> > > > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik 
> wrote:
> > > >
> > > > > Working with the 1.7.0, the feature that I was very interested in
> was
> > > the
> > > > > fixing of the Metadata Caching while using user impersonation.
> > > > >
> > > > > I have a large table, with a day directory that can contain up to
> > 1000
> > > > > parquet files each.
> > > > >
> > > > >
> > > > > Planning was getting terrible on this table as I added new data,
> and
> > > the
> > > > > metadata cache wasn't an option for me because of impersonation.
> > > > >
> > > > > Well now will 1.7.0 that's working, and it makes a HUGE
> difference. A
> > > > query
> > > > > that would take 120 seconds now takes 20 seconds.   Etc.
> > > > >
> > > > > Overall, this is a great feature and folks should look into it for
> > > > > performance of large Parquet tables.
> > > > >
> > > > > Some observations that I would love some help with.
> > > > >
> > > > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > > > generates
> > > > > the metadata for that directory with the missing data. This is
> > without
> > > > > another REFRESH TABLE METADATA command.  That works great for new
> > > > > directories, however, what happens if you just copy new files into
> an
> > > > > existing directory? Will it use the metadata cache that only lists
> > the
> > > > old
> > > > > files. or will things get updated? I guess, how does it know things
> > are
> > > > in
> > > > > sync?
> > > > >
> > > >
> > > > When you query folder A that contains metadata cache, Drill will
> check
> > > all
> > > > it's sub-directories' last modification time to figure out if
> anything
> > > > changed since last time the metadata cache was refreshed. If data was
> > > > added/removed, Drill will refresh the metadata cache for folder A.
> > > >
> > > >
> > > > > 2.  Pertaining to point 1, when new data was added, the first query
> > > that
> > > > > used that directory partition, seemed to write the metadata file.
> > > > However,
> > > > > the second query ran ALSO rewrote the file (and it ran with the
> speed
> > > of
> > > > an
> > > > > uncached directory).  However, the third query was now running at
> > > cached
> > > > 

Re: missing data in json structure when using web / api

2016-07-05 Thread Parth Chandra
As John hinted, a session is not maintained by the UI/REST api unless
impersonation is enabled. So your alter session commands will have no
effect on the query.
That does not explain why you are not getting full results though. Is it
possible that the query is getting an error because your session options
are not taking effect and that error is not being reported correctly.
I'm speculating that this might be a case that there are partial results
that are returned, then the query hits a schema change exception (because
maybe all text mode is not enabled) and that is causing early termination
of the query.




On Fri, Jul 1, 2016 at 4:23 PM, Scott Kinney  wrote:

> Not that I know of but I'm new to drill.
> I've done 'alter system' for json all_text_mode & read_numbers_as_double.
> Do you know of a setting that might cause something like this?
>
> 
> Scott Kinney | DevOps
> stem   |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>
> 
> From: John Omernik 
> Sent: Friday, July 01, 2016 4:06 PM
> To: user@drill.apache.org
> Subject: Re: missing data in json structure when using web / api
>
> Are you using options that are maintained in the cli but not the rest API
> due to a lack of impersonation?
>
> On Friday, July 1, 2016, Scott Kinney  wrote:
>
> > When i query from sqlline i can see all the data, very complicated /
> > nested json structure but when i query with the api or the web ui a lot
> of
> > the data is missing.
> >
> > ?
> >
> >
> > 
> > Scott Kinney | DevOps
> > stem    |   m  510.282.1299
> > 100 Rollins Road, Millbrae, California 94030
> >
> > This e-mail and/or any attachments contain Stem, Inc. confidential and
> > proprietary information and material for the sole use of the intended
> > recipient(s). Any review, use or distribution that has not been expressly
> > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > intended recipient, please contact the sender and delete all copies.
> Thank
> > you.
> >
>
>
> --
> Sent from my iThing
>


Re: Information about ENQUEUED state in Drill

2016-07-01 Thread Parth Chandra
The plan itself may have a hint as to why it took so long. One reason is if
there is a very large number of files and Drill is reading file metadata
for every file during the planning stage. This operation is not distributed
and can sometimes become a bottleneck.


On Fri, Jul 1, 2016 at 10:44 AM, John Omernik <j...@omernik.com> wrote:

> Yes, the planning is taking a long time. That is the issue.
>
> So, when queuing is enabled. "Planning" happens when the status is
> ENQUEUED. IF Queuing is not enabled, "Planning" happens when the status is
> "STARTING".  (Based on my observations).
>
> Are there any good docs, or sources to look at why the planning phase may
> be taking so long?
>
> Thanks!
>
> John
>
>
> On Fri, Jul 1, 2016 at 12:13 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > Most likely planing is taking longer to finish. Once it's done, it should
> > move to either ENQUEUED if the queuing was enabled or RUNNING if it was
> > disabled.
> >
> > One easy way to confirm if planing is indeed taking too long is to just
> run
> > a "EXPLAIN PLAN FOR " and see how long it takes to finish.
> >
> > On Fri, Jul 1, 2016 at 6:49 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > Interestingly enough, when I disable queuing, the query sits in the
> > > "STARTING" phase for the same amount of time it would sit in ENQUEUING
> if
> > > queuing was enabled.  Excessive planning?
> > >
> > > When looking at the UI, how can I validate this?
> > >
> > >
> > >
> > > On Fri, Jul 1, 2016 at 8:14 AM, John Omernik <j...@omernik.com> wrote:
> > >
> > > > I don't see that, but here's a question, when it's enqueued, it must
> > have
> > > > to do some level of planning before determining which queue it's
> going
> > to
> > > > fall into ... correct?  I wonder if that planning takes to long, if
> > > that's
> > > > what's causing the enqueued state?
> > > >
> > > >
> > > >
> > > > On Thu, Jun 30, 2016 at 1:09 PM, Parth Chandra <
> pchan...@maprtech.com>
> > > > wrote:
> > > >
> > > >> The queue that the queries are put in is determined by the cost
> > > calculated
> > > >> by the optimizer. So in Qiang's case, it might be that the cost
> > > >> calculation
> > > >> might be causing the query to be put in the large query queue.
> > > >>
> > > >> You can check the cost of the query in the query profile and compare
> > > with
> > > >> the value of the QUEUE_THRESHOLD_SIZE setting (exec.queue.threshold)
> > to
> > > >> see
> > > >> which queue the query is being put in.
> > > >>
> > > >> A single query staying enqueued for 30 seconds sounds really wrong.
> > > >> Putting
> > > >> a query in either queue requires getting a distributed semaphore
> (via
> > > >> zookeeper) and it is possible this is taking too long which is why
> the
> > > >> enqueuing may be taking really long.
> > > >>
> > > >> Do you see any messages in the logs about timeouts while enqueuing?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Jun 30, 2016 at 6:46 AM, John Omernik <j...@omernik.com>
> > wrote:
> > > >>
> > > >> > Thanks Parth.
> > > >> >  As I stated in, there are no other jobs running in the cluster
> when
> > > >> this
> > > >> > happens.  I do have queueing enabled, however, with no other jobs
> > > >> running,
> > > >> > why would any single job sit in the ENQUEUED state for 30 seconds?
> > > This
> > > >> > seems to be an issue or am I missing something?
> > > >> >
> > > >> > I would really like to use queueing as this is a multi-tenant
> > cluster,
> > > >> so I
> > > >> > don't want to remove it all together.
> > > >> >
> > > >> > John
> > > >> >
> > > >> > On Wed, Jun 29, 2016 at 10:57 PM, qiang li <tiredqi...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > I have the same doult.
> > > >> > >

Re: Parquet Block Size Detection

2016-07-01 Thread Parth Chandra
For metadata, you can use 'parquet-tools dump' and pipe the output to
more/less.
Parquet dump will print the block (aka row group) and page level metadata.
It will then dump all the data so be prepared to cancel when that happens.

Setting dfs.blocksize == parquet.blocksize is a very good idea and is the
general recommendataion.

Larger block (i.e row group) sizes will increase memory use on write. It
may not have a noticeable impact on read memory use as the current Parquet
reader reads data per page.

There are other potential effects of varying parquet block/row group size.
With filter pushdown to the row group level, a smaller row group will have
better chances of being effectively filtered out. This is still being
worked on, but will become a factor at some time.

Note that  Parquet file can have many row groups and can span many nodes,
but as long as a row group is not split across nodes, reader performance
will not suffer.








On Fri, Jul 1, 2016 at 1:09 PM, John Omernik <j...@omernik.com> wrote:

> I am looking forward to the MapR 1.7 dev preview because of the metadata
> user impersonation JIRA fix.   "Drill always writes one row group per
> file." So is this one parquet block?  "row group" is a new term to this
> email :)
>
> On Fri, Jul 1, 2016 at 2:09 PM, Abdel Hakim Deneche <adene...@maprtech.com
> >
> wrote:
>
> > Just make sure you enable parquet metadata caching, otherwise the more
> > files you have the more time Drill will spend reading the metadata from
> > every single file.
> >
> > On Fri, Jul 1, 2016 at 11:17 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > In addition
> > > 7. Generally speaking, keeping number of files low, will help in
> multiple
> > > phases of planning/execution. True/False
> > >
> > >
> > >
> > > On Fri, Jul 1, 2016 at 12:56 PM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > > > I looked at that, and both the meta and schema options didn't provide
> > me
> > > > block size.
> > > >
> > > > I may be looking at parquet block size wrong, so let me toss out some
> > > > observations, and inferences I am making, and then others who know
> the
> > > > spec/format can confirm or correct.
> > > >
> > > > 1. The block size in parquet is NOT file size. A Parquet file can
> have
> > > > multiple blocks in a single file? (Question: when this occurs, do the
> > > > blocks then line up with DFS block size/chunk size as recommended, or
> > do
> > > we
> > > > get weird issues?) In practice, do writes aim for 1 block per file?
> > > > 2. The block size, when writing is computed prior to compression.
> This
> > is
> > > > an inference based on the parquet-mr library.  A job that has a
> parquet
> > > > block size of 384mb seems to average files of around 256 mb in size.
> > > Thus,
> > > > my theory is that the amount of data in parquet block size is
> computed
> > > > prior to write, and then as the file is written compression is
> applied,
> > > > thus ensuring that the block size (and file size if 1 is not true, or
> > if
> > > > you are just writing a single file) will be under the dfs.block size
> if
> > > you
> > > > make both settings the same.
> > > > 3. Because of 2, setting dfs.blocksize = parquet blocksize is a good
> > > rule,
> > > > because the files will always be under the dfsblock size with
> > > compression,
> > > > ensuring you don't have cross block reads happening.  (You don't have
> > to,
> > > > for example, set the parquet block size to be less then dfs block
> size
> > to
> > > > ensure you don't have any weird issues)
> > > > 4.  Also because of 2, with compression enabled, you don't need any
> > slack
> > > > space for file headers or footers to ensure the files don't cross DFS
> > > > blocks.
> > > > 5. In general larger dfs/parquet block sizes will be good for reader
> > > > performance, however, as you start to get larger, write memory
> demands
> > > > increase.  True/False?  In general does a larger block size also put
> > > > pressures on Reader memory?
> > > > 6. Any other thoughts/challenges on block size?  When talking about
> > > > hundreds/thousands of GB of data, little changes in performance like
> > with
> > > > block size can make a difference.  I am really interested in
> > tips/stories
> > > > to he

Re: array in json with mixed values (int and float)

2016-07-01 Thread Parth Chandra
If you mean the REST api, then yes, there is no session maintained unless
impersonation is enabled. In that case the alter session would not have any
effect.



On Fri, Jul 1, 2016 at 11:58 AM, Scott Kinney <scott.kin...@stem.com> wrote:

> it didn't work when i did an alter session via the api but worked then i
> did and alter system via the repl. I'm guessing each query via the api is a
> session to alter sessions via the api only last for that one call?
>
> Anywho, that did the trick Parth, thank you!
>
>
> 
> Scott Kinney | DevOps
> stem   |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>
> 
> From: Scott Kinney <scott.kin...@stem.com>
> Sent: Friday, July 01, 2016 10:51 AM
> To: user@drill.apache.org
> Subject: Re: array in json with mixed values (int and float)
>
> That looks promising but didn't work.
>
>
> 
> Scott Kinney | DevOps
> stem   |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>
> 
> From: Parth Chandra <pchan...@maprtech.com>
> Sent: Friday, July 01, 2016 10:43 AM
> To: user@drill.apache.org
> Subject: Re: array in json with mixed values (int and float)
>
> I haven't tried this myself, but setting store.json.read_numbers_as_double
> to true might help.
>
>
>
> On Fri, Jul 1, 2016 at 9:27 AM, Scott Kinney <scott.kin...@stem.com>
> wrote:
>
> > When running a query on a json file via the api returns an error that i
> > dont see when running the same query in the REPL.
> >
> > "errorMessage" : "UNSUPPORTED_OPERATION ERROR: In a list of type FLOAT8,
> > encountered a value of type BIGINT. Drill does not support lists of
> > different types.\n\nFile
> >
> /PowerBladeAvahi.1.telemetry/json/telemetry_flatstore_3_2_prod-telemetry-w2-9_1.log-143771.json.gz\nRecord
> > 1\nLine  1\nColumn  502\nField  soc\nFragment 0:0\n\n[Error Id:
> > be38e1c4-b1c0-4d55-9ab1-fe4ebdc44a9e on ops-apachedrill:31010]"
> >
> >
> > I pulled the line out of the file. There is a key 'foo': [ 99, 99.1, 99.8
> > ].
> > Is there a way get drill to handle this? Maybe treat all ints as floats?
> >
> >
> > 
> > Scott Kinney | DevOps
> > stem <http://www.stem.com/>   |   m  510.282.1299
> > 100 Rollins Road, Millbrae, California 94030
> >
> > This e-mail and/or any attachments contain Stem, Inc. confidential and
> > proprietary information and material for the sole use of the intended
> > recipient(s). Any review, use or distribution that has not been expressly
> > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > intended recipient, please contact the sender and delete all copies.
> Thank
> > you.
> >
>


Re: array in json with mixed values (int and float)

2016-07-01 Thread Parth Chandra
I haven't tried this myself, but setting store.json.read_numbers_as_double
to true might help.



On Fri, Jul 1, 2016 at 9:27 AM, Scott Kinney  wrote:

> When running a query on a json file via the api returns an error that i
> dont see when running the same query in the REPL.
>
> "errorMessage" : "UNSUPPORTED_OPERATION ERROR: In a list of type FLOAT8,
> encountered a value of type BIGINT. Drill does not support lists of
> different types.\n\nFile
> /PowerBladeAvahi.1.telemetry/json/telemetry_flatstore_3_2_prod-telemetry-w2-9_1.log-143771.json.gz\nRecord
> 1\nLine  1\nColumn  502\nField  soc\nFragment 0:0\n\n[Error Id:
> be38e1c4-b1c0-4d55-9ab1-fe4ebdc44a9e on ops-apachedrill:31010]"
>
>
> I pulled the line out of the file. There is a key 'foo': [ 99, 99.1, 99.8
> ].
> Is there a way get drill to handle this? Maybe treat all ints as floats?
>
>
> 
> Scott Kinney | DevOps
> stem    |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>


Re: Performance querying a single column out of a parquet file

2016-07-01 Thread Parth Chandra
This has come up in the past in some other context. At the moment though,
there is no JIRA for this.

On Fri, Jul 1, 2016 at 6:10 AM, John Omernik  wrote:

> Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and
> asked if Drill could do this. (Late/Lazy Materialization of columns).
>
> While the performance gain on tables with less columns may not be huge ,
> when you are looking at really wide tables, with disparate date types, this
> can be huge.   For example, on one of my tables, if I do  "select id from
> table where id = 12 and location between 10 and 200" Drill will return in
> 30 seconds. When I run select * from from table where id = 12 and location
> between 10 and 200" and this query is well into 14 minutes of run time.
> That's a huge difference.
>
> Now, the initial answer may be "train user only to select the columns they
> need"  and yes, we will be working on that... HOWEVER as anyone who works
> in infosec knows, user training can be the best there is, and you will get
> people who don't follow the instructions. And, since this is such a intense
> query, those hit or miss queries with select * can then cause a large
> impact on the performance of a cluster.
>
> Do we have a JIRA open on late/lazy materialization of fields in Parquet?
>
> John
>
> On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning 
> wrote:
>
> > Not quite.
> >
> > With a fix for DRILL_1950, no rows would necessarily be materialized at
> all
> > for the filter columns. Rows would only be materialized for the
> projection
> > columns when the filter matches.
> >
> > In some cases, the pushdown might be implemented by fully materializing
> the
> > values referenced by the filter, but hopefully not.
> >
> >
> > On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann <
> > jzillm...@googlemail.com
> > > wrote:
> >
> > > Ok, thanks for the information!
> > >
> > > Am i right that in case DRILL-1950 would be fixed, Drill would
> > > automatically only materialize only those rows/columns which match the
> > > filter ?
> > >
> > > If not so, would the late materialization you described for the filter
> > > case be possible to implement with the current Hooks/API ?
> > >
> > > Johannes
> > >
> > > > On 11 Apr 2016, at 19:36, Aman Sinha  wrote:
> > > >
> > > > There is a JIRA related to one aspect of this: DRILL-1950 (filter
> > > pushdown
> > > > into parquet scan).  This is still work in progress I believe.  Once
> > that
> > > > is implemented, the scan will produce the filtered rows only.
> > > >
> > > > Regarding column projections, currently in Drill, the columns
> > referenced
> > > > anywhere in the query (including SELECT list) need to be produced by
> > the
> > > > table scan, so the scan will read all those columns, not just the
> ones
> > in
> > > > the filter condition.   You can see what columns are being produced
> by
> > > the
> > > > Scan node from the EXPLAIN plan.
> > > >
> > > > What would help for the SELECT * case is* late materialization of
> > > columns*.
> > > > i.e even if the filter does not get pushed down into scan,  we could
> > read
> > > > only the 'id' column from the table first, do the filtering that
> > > supposedly
> > > > selects 1 row, then do a late materialization of all other columns
> just
> > > for
> > > > that 1 row by using a row-id based lookup (if the underlying storage
> > > format
> > > > supports rowid based lookup).   This would be a feature request..I am
> > not
> > > > sure if a JIRA already exists for it or not.
> > > >
> > > > -Aman
> > > >
> > > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning 
> > > wrote:
> > > >
> > > >> I just replicated these results. Full table scans with aggregation
> > take
> > > >> pretty much exactly the same amount of time with or without
> filtering.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> > > >> jzillm...@googlemail.com
> > > >>> wrote:
> > > >>
> > > >>> Hey Ted,
> > > >>>
> > > >>> Sorry i mixed up row and column!
> > > >>>
> > > >>> Queries are like that:
> > > >>>(1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > >>>(2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > >>>
> > > >>> (1) is 14 sec and (2) is 1.5 sec.
> > > >>> Using drill-1.6.
> > > >>> So it looks like Drill is extracting the columns before filtering
> > > which i
> > > >>> didn’t expect…
> > > >>> Is there anyway to change that behaviour ?
> > > >>>
> > > >>> Johannes
> > > >>>
> > > >>>
> > > >>>
> > >  On 11 Apr 2016, at 16:42, Ted Dunning 
> > wrote:
> > > 
> > >  Did you mean that you are doing a select to find a single column?
> > What
> > > >>> you
> > >  typed was row, but that seems out of line with the rest of what
> you
> > > >>> wrote.
> > > 
> > >  If you are truly asking about filtering down to a single row,
> > whether
> > > >> it
> > > 

Re: Parquet Block Size Detection

2016-07-01 Thread Parth Chandra
parquet-tools perhaps?

https://github.com/Parquet/parquet-mr/tree/master/parquet-tools



On Fri, Jul 1, 2016 at 5:39 AM, John Omernik  wrote:

> Is there any way, with Drill or with other tools, given a Parquet file, to
> detect the block size it was written with?  I am copying data from one
> cluster to another, and trying to determine the block size.
>
> While I was able to get the size by asking the devs, I was wondering, is
> there any way to reliably detect it?
>
> John
>


Re: Tricks for Copying Data where Drill is actively querying

2016-06-30 Thread Parth Chandra
Hi John,

I've tried something like the following successfully -

select foo from  tablename/*/p_hour=1 and that will read all directories
'p_day=nnn' where the subdirectory is 'p_hour=1'


Parth


On Thu, Jun 30, 2016 at 6:43 AM, John Omernik  wrote:

> Vince - That is what I am doing now, using MapR Volumes, I am creating a
> .stage_%epoch% for each file copy. Once the data is fully copied (and no
> longer has the _COPYING_) I do a NFS filesystem mv to the directory it
> actually belongs in.
>
> Now, this is message, and forced me to add more to my ETL.  A couple of
> ideal things
>
> 1. In my view. I could use the select with options feature to add a
> filemask.  I.e. if I am looking at a directory (any directory, not just
> parquet) let me specify a filesystem glob (or fancier, a regex) that would
> allow me to tell Drill, only use these files.  It has to be a select with
> options type thing, because a setting like this should be a on a per table
> basis, not a system or session level options.
>
> 2.  Make Drill smart enough to handle wildcards in directories (in the
> "FROM" definitions)
>
> 3. Allow a global "ignore these files for everything" user configurable
> settings. Drill already does this for hidden files (proceeded with a .) But
> given everyones unique snowflake systems, an admin may have other "always
> ignore this in queries. (hadoop fs client users may specific *._COPYING_ as
> an always exclude. But there may be others)
>
>
>
> On Thu, Jun 30, 2016 at 7:08 AM, Vince Gonzalez 
> wrote:
>
> > I know it doesn't go right to the question of how to make drill ignore
> > things, but could you copy the data into some parallel tree, then rename
> it
> > into the appropriate directory once the copy is done?
> >
> > Or could that still cause a running query to fail?
> >
> > On Thursday, June 30, 2016, John Omernik  wrote:
> >
> > > I am doing query of source data that is two levels deep.
> > >
> > > tablename/p_day=2016-05-01/p_hour=1/file1.parquet
> > >
> > > I wasn't able to get wildcards at that level to work with dir0 etc.
> > >
> > >
> > >
> > >
> > > On Thu, Jun 30, 2016 at 12:39 AM, Ted Dunning  > > > wrote:
> > >
> > > > Does it work to provide a wild card in your source spec?
> > > >
> > > > a la dfs.tdunning.`/user/tdunning/foo/data/*.parquet`
> > > >
> > > > ?
> > > >
> > > >
> > > >
> > > > On Wed, Jun 29, 2016 at 1:06 PM, John Omernik  > > > wrote:
> > > >
> > > > > When the Hadoop FS client copies files (say parquet files) It adds
> a
> > > > > ._COPYING_ at the end of the file until it's complete.  If that's
> > there
> > > > > Drill fails (partial files etc).
> > > > >
> > > > > I know I can ignore files that start with . (or directories) but is
> > > > there a
> > > > > good way to tell Drill to ignore files that are not *.parquet, or
> > that
> > > > have
> > > > > ._COPYING_ at the end of them?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > John
> > > > >
> > > >
> > >
> >
> >
> > --
> >  
> >  Vince Gonzalez
> >  Systems Engineer
> >  212.694.3879
> >
> >  mapr.com
> >
>


Re: Information about ENQUEUED state in Drill

2016-06-29 Thread Parth Chandra
I would guess you have queueing enabled. With queueing enabled, only a max
number of queries will be actually running and the rest will wait in an
ENQUEUED state.

There are two queues: one for large queries and one for small queries. You
can change their size with the following parameters -
exec.queue.large
exec.queue.small






On Wed, Jun 29, 2016 at 1:51 PM, John Omernik  wrote:

> I have some jobs that will stay in an ENQUEUED state for what I think to be
> an excessive amount of time.  (No other jobs running on the cluster, the
> ENQUEUED state lasted for 30 seconds) . What causes this? Is it planning
> when it's in this state? Any information about this would be helpful.
>
> John
>


Re: Critical Bug with Column Name Clash

2016-06-29 Thread Parth Chandra
This is probably a bug in the jdbc storage plugin. Can you log a JIRA with
this info?

On Wed, Jun 29, 2016 at 8:46 AM, Till Haug  wrote:

>
> Hi guys
>
>
> My company encountered a critical bug in Apache Drill 1.7.0 (and earlier
> versions) and we’re not sure if this is an already known problem.
> If there are two columns with the same name in two different tables, there
> seems to be a conflict.
>
>
> Example 1:
> select t.emp_no as col_one, d.emp_no as col_two
> from mysqlaws.employees.titles as t, mysqlaws.employees.dept_manager as d
> where t.emp_no = d.emp_no
>
>
> Result 1:
> emp_no emp_no0
> 110022 null
> 110022 null
> 110039 null
> …
>
>
> Expected Result 1:
> emp_no emp_no0
> 110022 110022
> 110022 110022
> 110039 110039
> …
>
>
> Example 2:
> select t.from_date as col_one, d.from_date as col_two
> from mysqlaws.employees.titles as t, mysqlaws.employees.dept_manager as d
> where t.emp_no = d.emp_no
>
>
>
>
> Result 2:
> col_one col_two
> 1985-01-01 null
> 1985-01-01 null
> 1991-10-01 null
> …
>
>
> Expected Result 2:
> col_one col_two
> 1985-01-011985-01-01
> 1991-10-011985-01-01
> 1991-10-011991-10-01
> …
>
>
>
>
> In Example 1 there is no rename happening and the col_two is all nulls.
> In Example 2 the rename is happening, but the col_two is still all nulls.
>
>
> When we run these queries directly against the databases (both mssql and
> mysql) they work as expected.
>
>
> If you’d like to directly reproduce it, feel free to use our server we set
> up with the following storage plugin
>
>
> {
>  "type": "jdbc",
>  "driver": "com.mysql.jdbc.Driver",
>  "url": "jdbc:mysql://
> vz-test.cbnbj0e1vrwg.eu-central-1.rds.amazonaws.com:8008",
>  "username": "vz_master",
>  "password": "vzpassword",
>  "enabled": false
> }
>
>
> Thank you and All the Best
> Till
>
>
> Ps: I apologise for submitting this issue first wrongly on the dev mailing
> list.
>
>
>


Re: Drill with mapreduce

2016-06-29 Thread Parth Chandra
Are you using the drill-jdbc-all jar or the drill-jdbc jar? If you're using
the drill-jdbc jar then you will have to include the dependencies in the
classpath and might get other conflicts.

On Tue, Jun 28, 2016 at 11:38 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> This looks like a bug in the JDBC driver packaging. Can you raise a JIRA
> for the same?
>
> On Tue, Jun 28, 2016 at 9:10 PM, GameboyNO1 <7304...@qq.com> wrote:
>
> > Hi,
> > I'm trying to use drill with mapreduce.
> > Details are:
> > I put a list of drill queries in a file as mapper's input, some to query
> > hbase, some to query qarquet files. Every query is executed in mapper,
> and
> > the query result is sorted in reducer.
> > In mapper, I connect to drill with JDBC, and have problem of hitting Java
> > exception in mapper: NoClassDefFoundError on
> oadd/org/apache/log4j/Logger.
> > Anyone can give some help about how to fix it?
> > And also welcome comments on my solution.
> > Thanks!
> >
> >
> > Alfie
>


Re: [ANNOUNCE] Apache Drill 1.7.0 released

2016-06-29 Thread Parth Chandra
Nice work team!
And thanks Aman for managing the release so smoothly .



On Tue, Jun 28, 2016 at 9:14 PM, Aman Sinha  wrote:

> On behalf of the Apache Drill community, I am happy to announce the
> release of Apache Drill 1.7.0.
>
> The source and binary artifacts are available at [1]
> Review a complete list of fixes and enhancements at [2]
>
> This release of Drill fixes many issues and introduces a number of
> enhancements, including JMX enablement for monitoring, support for Hive
> CHAR type and HBase 1.x support.
>
> Thanks to everyone in the community who contributed to this release.
>
> [1] https://drill.apache.org/download/
> [2] https://drill.apache.org/docs/apache-drill-1-7-0-release-notes/
>
>
> -Aman
>


Re: gzipped json files not named .json.gz

2016-06-28 Thread Parth Chandra
Yes, I believe that would work if the file is not compressed.

On Tue, Jun 28, 2016 at 12:01 PM, Scott Kinney <scott.kin...@stem.com>
wrote:

> Well that's a bummer but I believe it setting "defaultInputFormat": "json"
> doesn't seem to have any effect.
>
>
> 
> Scott Kinney | DevOps
> stem   |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>
> 
> From: Parth Chandra <pchan...@maprtech.com>
> Sent: Tuesday, June 28, 2016 11:36 AM
> To: user@drill.apache.org
> Subject: Re: gzipped json files not named .json.gz
>
> Hi Scott,
>
>   Unlikely that this will work without the extension. Drill uses Hadoop's
> CompressionCodecFactory class [1] that infers the compression type from the
> extension.
>
> Parth
>
> [1]
>
> https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec(org.apache.hadoop.fs.Path)
>
> On Tue, Jun 28, 2016 at 8:47 AM, Scott Kinney <scott.kin...@stem.com>
> wrote:
>
> > Can I have drill open gzipped json files who's names do not end in
> > .json.gz?
> >
> > We have a spark job generating these files and it just dosn't want to
> > change the name or append the .json.gz.
> >
> > ?
> >
> >
> > 
> > Scott Kinney | DevOps
> > stem <http://www.stem.com/>   |   m  510.282.1299
> > 100 Rollins Road, Millbrae, California 94030
> >
> > This e-mail and/or any attachments contain Stem, Inc. confidential and
> > proprietary information and material for the sole use of the intended
> > recipient(s). Any review, use or distribution that has not been expressly
> > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > intended recipient, please contact the sender and delete all copies.
> Thank
> > you.
> >
>


Re: gzipped json files not named .json.gz

2016-06-28 Thread Parth Chandra
Hi Scott,

  Unlikely that this will work without the extension. Drill uses Hadoop's
CompressionCodecFactory class [1] that infers the compression type from the
extension.

Parth

[1]
https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec(org.apache.hadoop.fs.Path)

On Tue, Jun 28, 2016 at 8:47 AM, Scott Kinney  wrote:

> Can I have drill open gzipped json files who's names do not end in
> .json.gz?
>
> We have a spark job generating these files and it just dosn't want to
> change the name or append the .json.gz.
>
> ?
>
>
> 
> Scott Kinney | DevOps
> stem    |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>


Re: Problem with HDFS Encryption.

2016-06-28 Thread Parth Chandra
Glad you figured it out, and thanks for posting the solution!



On Mon, Jun 27, 2016 at 7:31 PM, Kidong Lee <mykid...@gmail.com> wrote:

> Hi Parth,
>
> thanks for your advice.
>
> I solved the problem.
>
> The problem was that the drill user which run drill did not know
> HADOOP_HOME env, thus drill could not read my hadoop site.xml confs.
> I have exported HADOOP_HOME explicitly with drill user and run drill, it
> works fine!!!
>
> - Kidong.
>
>
>
>
> 2016-06-28 9:58 GMT+09:00 Parth Chandra <pchan...@maprtech.com>:
>
> > You might need to make sure these files are in the Drill classpath. You
> > could create a link to these files (or a copy) in your DRILL_HOME/conf
> > directory and try.
> >
> > On Mon, Jun 27, 2016 at 4:45 PM, Kidong Lee <mykid...@gmail.com> wrote:
> >
> > > Yes, I did. I have also tested my parquet file which reside in the
> > > encryption zone and can be read with hive and parquet tool.
> > >
> > > - kidong
> > >
> > > 2016년 6월 28일 화요일, Parth Chandra<pchan...@maprtech.com>님이 작성한 메시지:
> > >
> > > > Hi Kidong,
> > > >
> > > >   I haven't tried this myself, but my guess is that the KMS settings
> > need
> > > > to be provided at the HDFS layer not in the drill storage plugin.
> > > >
> > > >   Specify hadoop.security.key.provider.path in core-site
> > > >
> > > >   Specify dfs.encryption.key.provider.uri  in hdfs-site
> > > >
> > > >   Or did you already do that?
> > > >
> > > > Parth
> > > >
> > > >
> > > > On Mon, Jun 27, 2016 at 1:11 AM, Kidong Lee <mykid...@gmail.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I got some problem using drill with HDFS Encryption.
> > > > >
> > > > > With Hive, DFS Storage, I got the errors like this:
> > > > > Error: SYSTEM ERROR: IOException: No KeyProvider is configured,
> > cannot
> > > > > access an encrypted file
> > > > >
> > > > > Even if I have added some confs below to drill storage plugin, the
> > > result
> > > > > is the same:
> > > > >
> > > > > in dfs storage:
> > > > > "config": {
> > > > > "hadoop.security.key.provider.path": "kms://h...@.com
> > > > <javascript:;>;
> > > > > .com:16000/kms",
> > > > > "dfs.encryption.key.provider.uri": "kms://h...@.com
> > > > <javascript:;>;
> > > > > .com:16000/kms"
> > > > >   }
> > > > >
> > > > > in hive storage:
> > > > > "configProps": {
> > > > >   ...
> > > > > "hadoop.security.key.provider.path": "kms://h...@.com
> > > > <javascript:;>;
> > > > > .com:16000/kms",
> > > > > "dfs.encryption.key.provider.uri": "kms://h...@.com
> > > > <javascript:;>;
> > > > > .com:16000/kms"
> > > > >   }
> > > > >
> > > > > I have tested with hive for the tables of the encrypted files on
> > hdfs,
> > > it
> > > > > works fine.
> > > > >
> > > > > Any idea.
> > > > >
> > > > > - Kidong Lee.
> > > > >
> > > >
> > >
> >
>


Re: Problem with HDFS Encryption.

2016-06-27 Thread Parth Chandra
You might need to make sure these files are in the Drill classpath. You
could create a link to these files (or a copy) in your DRILL_HOME/conf
directory and try.

On Mon, Jun 27, 2016 at 4:45 PM, Kidong Lee <mykid...@gmail.com> wrote:

> Yes, I did. I have also tested my parquet file which reside in the
> encryption zone and can be read with hive and parquet tool.
>
> - kidong
>
> 2016년 6월 28일 화요일, Parth Chandra<pchan...@maprtech.com>님이 작성한 메시지:
>
> > Hi Kidong,
> >
> >   I haven't tried this myself, but my guess is that the KMS settings need
> > to be provided at the HDFS layer not in the drill storage plugin.
> >
> >   Specify hadoop.security.key.provider.path in core-site
> >
> >   Specify dfs.encryption.key.provider.uri  in hdfs-site
> >
> >   Or did you already do that?
> >
> > Parth
> >
> >
> > On Mon, Jun 27, 2016 at 1:11 AM, Kidong Lee <mykid...@gmail.com
> > <javascript:;>> wrote:
> >
> > > Hi,
> > >
> > > I got some problem using drill with HDFS Encryption.
> > >
> > > With Hive, DFS Storage, I got the errors like this:
> > > Error: SYSTEM ERROR: IOException: No KeyProvider is configured, cannot
> > > access an encrypted file
> > >
> > > Even if I have added some confs below to drill storage plugin, the
> result
> > > is the same:
> > >
> > > in dfs storage:
> > > "config": {
> > > "hadoop.security.key.provider.path": "kms://h...@.com
> > <javascript:;>;
> > > .com:16000/kms",
> > > "dfs.encryption.key.provider.uri": "kms://h...@.com
> > <javascript:;>;
> > > .com:16000/kms"
> > >   }
> > >
> > > in hive storage:
> > > "configProps": {
> > >   ...
> > > "hadoop.security.key.provider.path": "kms://h...@.com
> > <javascript:;>;
> > > .com:16000/kms",
> > > "dfs.encryption.key.provider.uri": "kms://h...@.com
> > <javascript:;>;
> > > .com:16000/kms"
> > >   }
> > >
> > > I have tested with hive for the tables of the encrypted files on hdfs,
> it
> > > works fine.
> > >
> > > Any idea.
> > >
> > > - Kidong Lee.
> > >
> >
>


Re: Problem with HDFS Encryption.

2016-06-27 Thread Parth Chandra
Hi Kidong,

  I haven't tried this myself, but my guess is that the KMS settings need
to be provided at the HDFS layer not in the drill storage plugin.

  Specify hadoop.security.key.provider.path in core-site

  Specify dfs.encryption.key.provider.uri  in hdfs-site

  Or did you already do that?

Parth


On Mon, Jun 27, 2016 at 1:11 AM, Kidong Lee  wrote:

> Hi,
>
> I got some problem using drill with HDFS Encryption.
>
> With Hive, DFS Storage, I got the errors like this:
> Error: SYSTEM ERROR: IOException: No KeyProvider is configured, cannot
> access an encrypted file
>
> Even if I have added some confs below to drill storage plugin, the result
> is the same:
>
> in dfs storage:
> "config": {
> "hadoop.security.key.provider.path": "kms://h...@.com;
> .com:16000/kms",
> "dfs.encryption.key.provider.uri": "kms://h...@.com;
> .com:16000/kms"
>   }
>
> in hive storage:
> "configProps": {
>   ...
> "hadoop.security.key.provider.path": "kms://h...@.com;
> .com:16000/kms",
> "dfs.encryption.key.provider.uri": "kms://h...@.com;
> .com:16000/kms"
>   }
>
> I have tested with hive for the tables of the encrypted files on hdfs, it
> works fine.
>
> Any idea.
>
> - Kidong Lee.
>


Re: Enable or Disable multiple storage plugin at a time

2016-06-27 Thread Parth Chandra
You can enable/disable plugins from the GUI. See here for more:
https://drill.apache.org/docs/plugin-configuration-basics/



On Mon, Jun 27, 2016 at 4:03 AM, Sanjiv Kumar  wrote:

> Is it possible to enable or disable multiple storage plugins at a time
> through web console.
>OR Is there any other specific way to do so. ??
>
> --
>  ..
>   Thanks & Regards
>   *Sanjiv Kumar*
>


Re: Memory Settings for a Non-Sorted Failed Query

2016-06-14 Thread Parth Chandra
John,  can you log a JIRA and attach all the logs you have to the JIRA?



On Tue, Jun 14, 2016 at 11:43 AM, Parth Chandra <pchan...@maprtech.com>
wrote:

> I can see how the GC errors will cause the world to stop spinning. The GC
> is itself not able to allocate memory which is not a great place to be in.
>
> Sudheesh saw something similar in his branch. @Sudheesh is this possible
> we have a mem-leak in master?
>
>
>
>
> On Tue, Jun 14, 2016 at 11:37 AM, John Omernik <j...@omernik.com> wrote:
>
>> This is what I have thus far... I can provide more complete logs on a one
>> on one basis.
>>
>> The cluster was completely mine, with fresh logs. I ran a CTAS query on a
>> large table that over 100 fields. This query works well in other cases,
>> however I was working with the Block size, both in MapR FS and Drill
>> Parquet. I had successfully tested 512m on each, this case was different.
>> Here are the facts in this setup:
>>
>> - No Compression in MapRFS - Using Standard Parquet Snappy Compression
>> - MapR Block Size 1024m
>> - Parquet Block size 1024m
>> - Query  ends up disappearing in the profiles
>>
>> - The UI page listing bits only show 4 bits however 5 are running (node 03
>> process is running, but no longer in the bit)
>>
>> - Error (copied below)  from rest API
>>
>> - No output in STD out or STD error on node3. Only two nodes actually had
>> "Parquet Writing" logs. The other three on Stdout, did not have any
>> issues/errors/
>>
>> - I have standard log files, gclogs, the profile.json (before it
>> disappeared), and the physical plan.  Only some components that looked
>> possibly at issue included here
>>
>> - The Node 3 GC log shows a bunch of "Full GC Allocation Failures"  that
>> take 4 seconds or more (When I've seen this in other cases, this time can
>> balloon to 8 secs or more)
>>
>> - The node 3 output log show some issues with really long RPC issues.
>> Perhaps the GCs prevent RPC communication and create a snowball loop
>> effect?
>>
>>
>> Other logs if people are interested can be provided upon request. I just
>> didn't want to flood the whole list with all the logs.
>>
>>
>> Thanks!
>>
>>
>> John
>>
>>
>>
>>
>>
>>
>> Rest Error:
>>
>> ./load_day.py 2016-05-09
>>
>> Drill Rest Endpoint: https://drillprod.marathonprod.zeta:2
>> <https://drillprod.marathonprod.zeta.ctu-bo.secureworks.net:2/>
>>
>> Day: 2016-05-09
>>
>> /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769:
>> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
>> certificate verification is strongly advised. See:
>> https://urllib3.readthedocs.org/en/latest/security.html
>>
>>   InsecureRequestWarning)
>>
>> Authentication successful
>>
>> Error encountered: 500
>>
>> {
>>
>>   "errorMessage" : "SYSTEM ERROR: ForemanException: One more more nodes
>> lost connectivity during query.  Identified nodes were
>> [atl1ctuzeta03:20001].\n\n\n[Error Id:
>> d7dd0120-f7c0-44ef-ac54-29c746b26354
>> on atl1ctuzeta01 <http://atl1ctuzeta01.ctu-bo.secureworks.net:20001/
>> >:20001"
>>
>> }
>>
>>
>> Possible issue in Node3 Log:
>>
>>
>> 2016-06-14 17:25:27,860 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:90]
>> INFO  o.a.d.e.w.f.FragmentStatusReporter -
>> 289fc208-7266-6a81-73a1-709efff6c412:1:90: State to report: RUNNING
>>
>> 2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70]
>> INFO  o.a.d.e.w.fragment.FragmentExecutor -
>> 289fc208-7266-6a81-73a1-709efff6c412:1:70: State change requested
>> AWAITING_ALLOCATION --> RUNNING
>>
>> 2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70]
>> INFO  o.a.d.e.w.f.FragmentStatusReporter -
>> 289fc208-7266-6a81-73a1-709efff6c412:1:70: State to report: RUNNING
>>
>> 2016-06-14 17:43:41,869 [BitServer-4] WARN
>> o.a.d.exec.rpc.control.ControlClient - Message of mode RESPONSE of rpc
>> type
>> 1 took longer than 500ms.  Actual duration was 4192ms.
>>
>> 2016-06-14 17:45:36,720 [CONTROL-rpc-event-queue] INFO
>> o.a.d.e.w.fragment.FragmentExecutor -
>> 289fc208-7266-6a81-73a1-709efff6c412:1:0: State change requested RUNNING
>> --> CANCELLATION_REQUESTED
>>
>> 2016-06-14 17:45:45,740 [CONTROL-rpc-event-queue] INFO
>> o.a.d.e.w.f.FragmentStatusReporter -
>> 289fc

Re: Memory Settings for a Non-Sorted Failed Query

2016-06-13 Thread Parth Chandra
Yes, we can discuss this on the hangout.
You're right, there are two issues -
  Limiting memory usage to a maximum limit should be the goal of every
component. We are not there yet with Drill though.
  Getting an Out of Memory Error and having the Drillbit become
unresponsive is something we should rarely see as either the Drill
allocator or the JVM successfully catch the condition. Can you grep your
logs so we can see if that indeed is what happened?



On Mon, Jun 13, 2016 at 4:27 PM, John Omernik <j...@omernik.com> wrote:

> I'd like to talk about that on the hangout.  Drill should do better at
> failing with a clean oom error rather then having a bit go unresponsive.
> Can just that bit be restarted to return to a copacetic state? As an admin,
> if this is the case, how do I find this bit?
>
> Other than adding RAM, are there any query tuning settings that could help
> prevent the unresponsive bit? ( I see this as two issues, the memory
> settings for the 1024m block size CTAS and the how can we prevent a bit
> from going unresponsive? )
> On Jun 13, 2016 6:19 PM, "Parth Chandra" <pchan...@maprtech.com> wrote:
>
> The only time I've seen a drillbit get unresponsive is when you run out of
> Direct memory. Did you see any 'Out of Memory Error' in your logs? If you
> see those then you need to increase the Direct memory setting for the JVM.
> ( DRILL_MAX_DIRECT_MEMORY in drill-env.sh)
>
>
>
>
> On Mon, Jun 13, 2016 at 4:10 PM, John Omernik <j...@omernik.com> wrote:
>
> > The 512m block size worked.  My issue with the 1024m block size was on
> the
> > writing using a CTAS that's where my nodes got into a bad
> statethus
> > I am wondering what setting on drill would be the right setting to help
> > node memory pressures on a CTAs using 1024m block size
> > On Jun 13, 2016 6:06 PM, "Parth Chandra" <pchan...@maprtech.com> wrote:
> >
> > In general, you want to make the Parquet block size and the HDFS block
> size
> > the same. A Parquet block size that is larger than the HDFS block size
> can
> > split a Parquet block ( i.e. row_group ) across nodes and that will
> > severely affect performance as data reads will no longer be local. 512 MB
> > is a pretty good setting.
> >
> > Note that you need to ensure the Parquet block size in the source file
> > which (maybe) was produced outside of Drill. So you will need to make the
> > change in the application used to write the Parquet file.
> >
> > If you're using Drill to write the source file as well then, of course,
> the
> > block size setting will be used by the writer.
> >
> > If you're using the new reader, then there is really no knob you have to
> > tweak. Is parquet-tools able to read the file(s)?
> >
> >
> >
> > On Mon, Jun 13, 2016 at 1:59 PM, John Omernik <j...@omernik.com> wrote:
> >
> > > I am doing some performance testing, and per the Impala documentation,
> I
> > am
> > > trying to use a block size of 1024m in both Drill and MapR FS.  When I
> > set
> > > the MFS block size to 512 and the Drill (default) block size I saw some
> > > performance improvements, and wanted to try the 1024 to see how it
> > worked,
> > > however, my query hung and I got into that "bad state" where the nodes
> > are
> > > not responding right and I have to restart my whole cluster (This
> really
> > > bothers me that a query can make the cluster be unresponsive)
> > >
> > > That said, what memory settings can I tweak to help the query work.
> This
> > is
> > > quite a bit of data, a CTAS from Parquet to Parquet, 100-130G of data
> per
> > > data (I am doing a day at a time), 103 columns.   I have to use the
> > > "use_new_reader" option due to my other issues, but other than that I
> am
> > > just setting the block size on MFS and then updating the block size in
> > > Drill, and it's dying. Since this is a simple CTAS (no sort) which
> > settings
> > > can be beneficial for what is happening here?
> > >
> > > Thanks
> > >
> > > John
> > >
> >
>


Re: Memory Settings for a Non-Sorted Failed Query

2016-06-13 Thread Parth Chandra
The only time I've seen a drillbit get unresponsive is when you run out of
Direct memory. Did you see any 'Out of Memory Error' in your logs? If you
see those then you need to increase the Direct memory setting for the JVM.
( DRILL_MAX_DIRECT_MEMORY in drill-env.sh)




On Mon, Jun 13, 2016 at 4:10 PM, John Omernik <j...@omernik.com> wrote:

> The 512m block size worked.  My issue with the 1024m block size was on the
> writing using a CTAS that's where my nodes got into a bad statethus
> I am wondering what setting on drill would be the right setting to help
> node memory pressures on a CTAs using 1024m block size
> On Jun 13, 2016 6:06 PM, "Parth Chandra" <pchan...@maprtech.com> wrote:
>
> In general, you want to make the Parquet block size and the HDFS block size
> the same. A Parquet block size that is larger than the HDFS block size can
> split a Parquet block ( i.e. row_group ) across nodes and that will
> severely affect performance as data reads will no longer be local. 512 MB
> is a pretty good setting.
>
> Note that you need to ensure the Parquet block size in the source file
> which (maybe) was produced outside of Drill. So you will need to make the
> change in the application used to write the Parquet file.
>
> If you're using Drill to write the source file as well then, of course, the
> block size setting will be used by the writer.
>
> If you're using the new reader, then there is really no knob you have to
> tweak. Is parquet-tools able to read the file(s)?
>
>
>
> On Mon, Jun 13, 2016 at 1:59 PM, John Omernik <j...@omernik.com> wrote:
>
> > I am doing some performance testing, and per the Impala documentation, I
> am
> > trying to use a block size of 1024m in both Drill and MapR FS.  When I
> set
> > the MFS block size to 512 and the Drill (default) block size I saw some
> > performance improvements, and wanted to try the 1024 to see how it
> worked,
> > however, my query hung and I got into that "bad state" where the nodes
> are
> > not responding right and I have to restart my whole cluster (This really
> > bothers me that a query can make the cluster be unresponsive)
> >
> > That said, what memory settings can I tweak to help the query work. This
> is
> > quite a bit of data, a CTAS from Parquet to Parquet, 100-130G of data per
> > data (I am doing a day at a time), 103 columns.   I have to use the
> > "use_new_reader" option due to my other issues, but other than that I am
> > just setting the block size on MFS and then updating the block size in
> > Drill, and it's dying. Since this is a simple CTAS (no sort) which
> settings
> > can be beneficial for what is happening here?
> >
> > Thanks
> >
> > John
> >
>


Re: Memory Settings for a Non-Sorted Failed Query

2016-06-13 Thread Parth Chandra
In general, you want to make the Parquet block size and the HDFS block size
the same. A Parquet block size that is larger than the HDFS block size can
split a Parquet block ( i.e. row_group ) across nodes and that will
severely affect performance as data reads will no longer be local. 512 MB
is a pretty good setting.

Note that you need to ensure the Parquet block size in the source file
which (maybe) was produced outside of Drill. So you will need to make the
change in the application used to write the Parquet file.

If you're using Drill to write the source file as well then, of course, the
block size setting will be used by the writer.

If you're using the new reader, then there is really no knob you have to
tweak. Is parquet-tools able to read the file(s)?



On Mon, Jun 13, 2016 at 1:59 PM, John Omernik  wrote:

> I am doing some performance testing, and per the Impala documentation, I am
> trying to use a block size of 1024m in both Drill and MapR FS.  When I set
> the MFS block size to 512 and the Drill (default) block size I saw some
> performance improvements, and wanted to try the 1024 to see how it worked,
> however, my query hung and I got into that "bad state" where the nodes are
> not responding right and I have to restart my whole cluster (This really
> bothers me that a query can make the cluster be unresponsive)
>
> That said, what memory settings can I tweak to help the query work. This is
> quite a bit of data, a CTAS from Parquet to Parquet, 100-130G of data per
> data (I am doing a day at a time), 103 columns.   I have to use the
> "use_new_reader" option due to my other issues, but other than that I am
> just setting the block size on MFS and then updating the block size in
> Drill, and it's dying. Since this is a simple CTAS (no sort) which settings
> can be beneficial for what is happening here?
>
> Thanks
>
> John
>


Re: drillbit colocation questions

2016-06-08 Thread Parth Chandra
On Mon, Jun 6, 2016 at 12:02 PM, Wesley Chow  wrote:

> I have some general questions that I've been unable to google. I'm
> particularly interested in co-locating drillbits with nodes in a custom
> store of ours, so I've been poking around in source and searching about for
> examples of this.
>
> 1. My understanding is that Drill understands HDFS and if you co-locate a
> drillbit with a data node, then Drill will automatically distribute queries
> to the drillbits on the nodes that contain the relevant files.
>
> 1a. Where does drill run a join then? On the node that initiated the query,
> or on one of the nodes that contain the data?
>

The join is also distributed. At some point one side of the join may be
broadcast to all the nodes or an exchange operation will distribute data
appropriately across the nodes to get maximum utilization of the cluster.

>
> 1b. Does Drill automatically look up which nodes hold the data in question,
> or is this specified in the query somehow?
>
> Drill will look it up as part of the query planning (in the case of
Parquet file data, this information could be cached in a metadata cache
file)


> 2. Does drill also understand data distribution in HBase? Do queries get
> sent to nodes that contain the HBase rows in question?
>


> 3. We have a custom data store that we'd like to be Drill aware, but want a
> drillbit on the machine itself. Are there any examples of co-locating
> drillbits with non-HDFS data sources?
>
>
Don't have examples to share with you but there is no reason why you cannot
colocate drill with your custom store.



> 4. If we place files on a bunch of different servers and install drillbits
> on each one, and we determine which servers contain which files
> out-of-band, is there a way to submit a query to drill that tells it which
> nodes contain local files to read?
>
>
You won't be able to specify data locality information as part of the
query; this is typically discovered by Drill by calling the storage plugin.
You might need a custom storage plugin for your store.


> Btw, I would be really interested in chatting /drinking with someone who
> nows the Drill code well and is based in NYC.
>
> Thanks,
> Wes
>


Re: Issue with connecting to Drill via ODBC

2016-06-08 Thread Parth Chandra
Can you check your locale setting? Can you set your code page to UTF-8 and
try?
BTW, we haven't really tried the ODBC driver on Debian, and I remember we
had trouble with the Drill client libraries being built correctly on Ubuntu
so, even if you get past this, it might not be possible to run ODBC on
Debian. But please let us know if you are able to get this to work
successfully.


On Wed, Jun 8, 2016 at 1:02 PM, marcin kossakowski <
marcin.kossakow...@gmail.com> wrote:

> I'm trying to connect to Drill cluster via ODBC from debian linux.
> Documentation states a requirement is Red Hat, CentOS or SUSE. I tried
> connverting rpm package to deb and was able to connect but when
> submitting query I get parse errors showing query as jumbled
> characters.
>


Re: best approach for complex, several levels of json nesting

2016-06-08 Thread Parth Chandra
Well, your best bets are still JSON and Parquet. Parquet is more compact
and therefore likely to be faster. Internally, Drill will keep nested data
as a nested type and will not flatten it out unless you want it to. Even
with the nested structure, you can refer to individual fields without
having to flatten the data out.
If you want the nested structures to be flattened, then you will need to
use FLATTEN and KVGEN. With multiple levels you will end up with fairly
complex queries as you will need to unravel one level at a time in a
subquery. The usual way people achieve this is by creating views for each
subquery.

On Wed, Jun 8, 2016 at 10:36 AM, Scott Kinney  wrote:

> We have lots a different json structures gzipped in s3 that we want to
> query (currently looking at Drill and Druid). What is the best approach for
> getting this into a queryable format for drill? I tried
> FLATTEN(KVGEN(data)) but since out structures are often nested multiple
> levels this doesn't work. ?We have also converted to parquet but when i run
> drill on a parquet file the structure isn't getting flattened either.
>
> What is the best approach for this situation?
>
> Thanks all,
>
>
>
> 
> Scott Kinney | DevOps
> stem    |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
> This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.
>


Re: [ANNOUNCE] New PMC Chair of Apache Drill

2016-05-25 Thread Parth Chandra
Thanks everyone, and in particular, thank you, Jacques, for making Drill
possible.

On Wed, May 25, 2016 at 3:31 PM, Chunhui Shi <c...@maprtech.com> wrote:

> Big congratulations to Parth!
> Thanks Jacques for founding Drill project and way to go drillers!
>
> Chunhui
>
> On Wed, May 25, 2016 at 11:45 AM, John Omernik <j...@omernik.com> wrote:
>
> > Congratz Parth, and thank you Jacques!
> >
> > On Wed, May 25, 2016 at 1:25 PM, Xiao Meng <xiaom...@gmail.com> wrote:
> >
> > > Big congratulations, Parth!
> > >
> > > And thank you, Jacques, for the leadership and the tremendous
> > contributions
> > > to the community.
> > >
> > > Best,
> > >
> > > Xiao
> > >
> > > On Wed, May 25, 2016 at 8:35 AM, Jacques Nadeau <jacq...@dremio.com>
> > > wrote:
> > >
> > > > I'm pleased to announce that the Drill PMC has voted to elect Parth
> > > Chandra
> > > > as the new PMC chair of Apache Drill. Please join me in
> congratulating
> > > > Parth!
> > > >
> > > > thanks,
> > > > Jacques
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>


Re: Hangout Frequency

2016-05-23 Thread Parth Chandra
The overwhelming response (?!) seems to have been to agree to have the
hangout every other week.
So the next hangout will be Tuesday 5/31.

See you all then.

On Fri, May 20, 2016 at 7:34 PM, Aman Sinha <amansi...@apache.org> wrote:

> Every other week sounds good to me.  It is a substantial commitment to do
> one every week.
> Many useful discussions already happen on the dev and user mailing lists.
>
> On Fri, May 20, 2016 at 12:44 PM, Parth Chandra <pchan...@maprtech.com>
> wrote:
>
> > Drill Users, Devs,
> >
> >   Attendance at the hangouts has been getting sparse and it seems like
> the
> > hangouts are too frequent. I'd like to propose that we move to having
> > hangouts every other week.
> >
> >   What do folks think?
> >
> > Parth
> >
>


Hangout Frequency

2016-05-20 Thread Parth Chandra
Drill Users, Devs,

  Attendance at the hangouts has been getting sparse and it seems like the
hangouts are too frequent. I'd like to propose that we move to having
hangouts every other week.

  What do folks think?

Parth


Re: Hangout?

2016-05-17 Thread Parth Chandra
Starting in a minute

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc



On Tue, May 17, 2016 at 9:59 AM, John Omernik  wrote:

> Is there link today?
>


Hangout starting now

2016-05-10 Thread Parth Chandra
Please join us for the Drill hangout:

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Re: Continued Avro Frustration

2016-04-01 Thread Parth Chandra
+1 on marking Avro experimental.

@Stefan, we have been trying to help you as much as our time permits. I
know that I held up the 1.6 release while Jason fixed the issues that you
brought up. As was said earlier, this is personal time we are spending to
help users in the community, so providing an immediate response to everyone
is difficult. Ultimately, it boils down to the relationships one builds
within the community. Folks with shared goals help each other and everyone
benefits.



On Fri, Apr 1, 2016 at 11:10 AM, Jacques Nadeau  wrote:

> Stefan,
>
> It makes sense to me to mark the Avro plugin experimental. Clearly, there
> are bugs. I also want to note your requirements and expectations haven't
> always been in alignment with what the Avro plugin developers
> built/envisioned (especially around schemas). As part of trying to address
> these gaps, I'd like to ask again for you to provide actual data and tests
> cases so we make sure that the Avro plugin includes those as future test
> cases. (This is absolutely the best way to ensure that the project
> continues to work for your use case.)
>
> The bigger issue I see here is that you expect the community to spend time
> doing what you want. You have already received a lot of that via free
> support and numerous bug fixes by myself, Jason and others. You need to
> remember: this community is run by a bunch of volunteers. Everybody here
> has a day job. A lot of time I spend in the community is at the cost of my
> personal life. For others, it is the same.
>
> This is a good place to ask for help but you should never demand it. If you
> want paid support, I know Ted offered this from MapR and I'm sure if you
> went that route, your issues would get addressed very quickly. If you don't
> want to go that route, then I suggest that you help by creating more
> example data and test cases and focusing on what are the most important
> issues that you need to solve. From there, you can continue to expect that
> people will help you--as they can. There are no guarantees in open source.
> Everything comes through the kindness and shared goals of those in the
> community.
>
> thanks,
> Jacques
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Apr 1, 2016 at 5:43 AM, Stefán Baxter 
> wrote:
>
> > Hi,
> >
> > Is it at all possible that we are the only company trying to use Avro
> with
> > Drill to some serious extent?
> >
> > We continue to coma across all sorts of embarrassing shortcomings like
> the
> > one we are dealing with now where a schema change exception is thrown
> even
> > when working with a single Avro file (that has the same schema).
> >
> > Can a non project member call for a discussion on this topic and the
> level
> > of support that is offered for Avro in Drill?
> >
> > My discussion topics would be:
> >
> >- Strange schema validation that ... :
> >... currently fails on single file
> >... prevents dirX variables to work
> >... would require Drill to scan all Avro files to establish schema
> (even
> >when pruning would be used)
> >... would ALWAY fail for old queries if the an old Avro file,
> containing
> >the original fields, was removed and could not be scanned
> >... does not rhyme with the "eliminate ETL" and "Evolving Schema"
> goals
> >of Drill
> >
> >- Simple union types do not work to declare nullable fields
> >
> >- Drill can not read Parquet that is created by parquet-mr-avro
> >
> >- What is the intention for Avro in Drill
> >- Should we select to use some other format to buffer/badge data
> before
> >creating a Parquet file for it?
> >
> >- The culture here regarding talking about boring/hard topics like
> this
> >- Where serious complaints/issues are met with silence
> >- I know full well that my frustration shines through here and that it
> >not helping but this Drill+Avro mess is really getting too much for us
> > to
> >handle
> >
> > Look forward do discuss this here or during the next hangout.
> >
> > Regards,
> >  -Stefán (or ... mr. old & frustrated)
> >
>


Re: Drill Hangout Starting

2016-03-29 Thread Parth Chandra
Sorry some of us are not able to attend today due to an internal meeting at
work.

On Tue, Mar 29, 2016 at 10:01 AM, Jacques Nadeau  wrote:

> https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=0
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>


[ANNOUNCE] Apache Drill 1.6.0 released

2016-03-19 Thread Parth Chandra
On behalf of *Apache* *Drill* community, I am happy to *announce* the
release of *Apache* *Drill* 1.6.0.

The source and binary artifacts are available at [1]
Review a complete list of fixes and enhancements at [2]

This release of *Drill* fixes many issues and introduces a number of
enhancements, including inbound impersonation, support for JDK 1.8, and
additional custom window frames.

Thanks to everyone in the community who contributed in this release.

[1] http://drill.apache.org/download/
[2] http://drill.apache.org/docs/apache-drill-1-6-0-release-notes/


Re: The praises for Drill

2016-02-26 Thread Parth Chandra
Welcome back Edmon, and thanks for the praise :). Hope to see you on the
next hangout.

On Thu, Feb 25, 2016 at 7:27 PM, Edmon Begoli  wrote:

> Hello fellow Driilers,
>
> I have been inactive on the development side of the project, as we got busy
> being heavy/power users of the Drill in the last few months.
>
> I just want to share some great experiences with the latest versions of
> Drill.
>
> Just tonight, as we were scrambling to meet the deadline, we were able to
> query two years of flat psv files of claims/billing and clinical data in
> Drill in less than 60 seconds.
>
> No ETL, no warehousing - just plain SQL against tons of files. Run SQL, get
> results.
>
> Amazing!
>
> We have also done some much more important things too, and we had a paper
> accepted to Big Data Services about the experiences. The co-author of the
> paper is Drill's own Dr. Ted Dunning :-)
> I will share it once it is published.
>
> Anyway, cheers to all, and hope to re-join the dev activities soon.
>
> Best,
> Edmon
>


Re: Silly question about ODBC/JDBC Connections

2015-12-12 Thread Parth Chandra
Neither is encrypted. Support for SSL has been discussed, but not been
implemented. I believe it is not too hard to turn on SSL support for JDBC
but ODBC might be a trickier implementation.
There is one related JIRA for this -
https://issues.apache.org/jira/browse/DRILL-2496.

On Tue, Dec 8, 2015 at 11:58 AM, John Omernik  wrote:

> You would think this question would be easier to search for but I
> struggled. Perhaps my googlefu is off to day... I did hear a strange "Bing"
> behind me...
>
>
> Anywho: Are Drill ODBC and JDBC Connections encrypted? If so, just the
> User/Pass exchanges or the whole thing? (data too?)
>
> Thanks!
>
> John
>


Re: Drill Hangout Happening

2015-11-03 Thread Parth Chandra
Hangout at the following link :

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Re: Drill Logging

2015-10-13 Thread Parth Chandra
Logback predefines a variable HOSTNAME if you want to use it anywhere in
your logback.xml

Here's what I have in my logback -



  /var/log/drill/drillbit_${HOSTNAME}.log

  


/var/log/drill/logs/archive/drillbit_${HOSTNAME}.log.%i

1

10

  
On Mon, Oct 12, 2015 at 6:02 PM, Jacques Nadeau  wrote:

> I don't have a specific answer but the logging is defined in this file:
>
>
> https://github.com/apache/drill/blob/master/distribution/src/resources/logback.xml
>
> You can see that there are two loggers, one named QUERY and one named FILE.
> Log query path and log path come from the two environment variables you've
> identified.
>
> I'm guessing that the issue you're having has something to do with a
> failure to propagate these values.
>
> With regards to hostname in logfile, I believe logback exposes a parameter
> that can be included in the logback.xml, as we've used this before. Maybe
> Kunal or Steven (cc'd) remembers more on this.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Oct 12, 2015 at 6:13 AM, John Omernik  wrote:
>
> > A follow-up, since I am using "runbit"  The line below exists. It sources
> > drill-config.sh which then sources drill-env.sh. In neither of these
> files
> > do I see the variables DRILLBIT_LOG_PATH and DRILLBIT_QUERY_LOG_PATH  so
> > that may be the "file not found" error I am getting? If I set those via
> > export in drill-env.sh, to be directories, then I get "File not found
> > exceptions" but this time it shows the path that I am setting and claims
> > the error is because the "Path" is a directory.   I guess, am I seeing
> some
> > sort of bug due to what is expected by runbit vs. drillbit.sh. Looking at
> > drillbit.sh, it looks like hit has more settings such as DRILL_OUTFILE,
> > DRILL_LOGFILE, and DRILL_QUERY_FILE none of which exist in runbit.
> Should
> > we be having drillbit.sh and runbit be more consistent?
> >
> >
> >
> > exec $JAVA -Dlog.path=$DRILLBIT_LOG_PATH -Dlog.query.path=
> > $DRILLBIT_QUERY_LOG_PATH $DRILL_ALL_JAVA_OPTS -cp $CP
> > org.apache.drill.exec.server.Drillbit
> >
> > On Thu, Oct 8, 2015 at 12:09 PM, John Omernik  wrote:
> >
> > > Hello all -
> > >
> > > I am playing with Drill (I have 1.2 running right now) and I am trying
> to
> > > figure out a way to do some logging in a logical way.  I know I am
> > running
> > > outside the norm, in that I am running my drill bits using marathon on
> > > mesos, but for a moment (unless it's THAT that is breaking my logging
> :)
> > > ignore that fact :)
> > >
> > > I am using MapR, which is nice because it lets me NFS mount my
> > distributed
> > > file system (MapRFS) Thus, when I run a drillbit, I package up the
> > > drill-1.2.0 directory (I am using the MapR packaged version) and then
> in
> > > Marathon I create a app profile like this
> > >
> > > {
> > > "cmd": "./drill-1.2.0/bin/runbit --config
> > > /mapr/zetapoc/mesos/prod/drill/conf.std",
> > > "cpus": 12.0,
> > > "mem": 22528,
> > > "id": "zetadrill",
> > > "instances": 5,
> > > "uris":["hdfs:///mesos/prod/drill/drill-1.2.0.tgz"],
> > > "constraints": [["hostname", "UNIQUE"]]
> > > }
> > >
> > > Pretty basic, for non Mesos folks, it takes the URI (the tgz) downloads
> > it
> > > to the Mesos Sandbox, untars it there, and runs 'runbit' using cgroups.
> > > Thus the memory and cpu specifications.  I am passing it a full path to
> > the
> > > config location which is just the /conf  copied out of the drill-1.2.0
> > > directory and modified. Thus, all bits are pulling from the same conf.
> (I
> > > could package the conf in the tgz, this makes its easy to change a
> config
> > > and not have to repackage the tgz)
> > >
> > > Also, as you can see I am using the constraint to only have one per
> node,
> > > this is handy from a port conflict point of view.
> > >
> > > Ok, that out of the way, when I run it, in my drill-env.sh, I am
> setting:
> > > export DRILL_LOG_DIR="./drill-1.2.0/log"
> > >
> > > What I "think" that means is it should log inside the sandbox. And sure
> > > enough, when I look in the sandbox in mesos, in the log directory,
> there
> > is
> > > a profiles folder and the sys.drill (query logs... I think?)
> > >
> > > Great.
> > >
> > > Now, in the stdout on each node that started drill, I do have the error
> > > below... and additionally, I don't see the drillbit.log and
> drillbit.out
> > > files.  (Is that because of the errors below?) Wouldn't these be in
> > > ./drill-1.2.0/log as well?
> > >
> > > Side note: If I am doing a CTAS to Parquet, one of my nodes starts
> > > spitting out INFO logs to the console like crazy during the write, not
> > sure
> > > what that is all about...
> > >
> > > Ok, back to logging. I am trying to organize this, ideally, I want to
> > > understand logging deeply to meet a number of requirements
> > >
> > > 1.  I'd like all profiles/sql audit logs logged to one place.  Ideally,
> > 

Drill Hangout minutes - 2015-10-06 Re: Drill Hangout starting now

2015-10-06 Thread Parth Chandra
Drill Hangout 2015-10-06

Attendees: Aman, Andries, Daniel, Kris, Charlie, Julien, Jacques, Jason,
Jinfeng, Matt, Parth, Sudheesh, Venki


   1.

   Matt hitting issues with Information Schema queries against Hive. Will
   connect with Venki on Slack to resolve.
   2.

   Julien reported that he's working on speeding up building and running
   tests, noting that build-time code generation runs twice and local
   Drillbits for testing take 3 second to shut down.
   3.

   Parth mentioned an off-by-one bug in Parquet reading and that he will
   add more Parquet reading tests as part of the fix.
   4.

   Aman reported a regression in performance while trying metadata caching
   with 400K files. This is being investigated.
   5.

   Daniel, Jacques, and Sudeesh discussed issues underlying DRILL-2288,
   such as the ScanBatch.next() return value (IterOutcome) contract, handling
   empty JSON files, handling zero-row sources that still have schemas, how to
   limit the DRILL-2288 fix to avoid needing to rework lots of downstream
   code, etc.
   6.

   Sudheesh had various updates - Limit 0 and Limit 1 queries. Jacques
   suggestion to handle Limit 0 queries on schema aware systems to the
   planning phase. Perf tests on the RPC processing offloading seem to show
   higher memory consumption. This may simply be due to allowing more
   concurrent queries as result of the patch. Perf tests reveal issues on
   local data tunnel changes but these may be existing problems that are now
   showing up as a result of faster local data processing. Question to be
   resolved - should we merge these anyway?
   7.

   Jason helping address some recent issues with flatten involving large
   number of repeated values.
   8.

   We unanimously volunteered Sudheesh to work on the performance cluster.




On Tue, Oct 6, 2015 at 10:06 AM, Parth Chandra <par...@apache.org> wrote:

>
>
> Join us here:
>> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>>
>
>


Hangout happening now!

2015-09-08 Thread Parth Chandra
Come join the Drill community as we discuss what has been happening lately
and what is in the pipeline. All are welcome, if you know about Drill, want
to know more or just want to listen in.

Link: https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Re: xml files with Drill

2015-09-04 Thread Parth Chandra
Agree with Jason that it would be useful to know what your use case is. I'm
sure a large subset of use cases can be handled (eventually). Many EDI
messages, for example, are defined easily in XML and could be handled by
Drill's model.
What is the workaround for your problem? It would be helpful to others if
you could share.

Parth

On Fri, Sep 4, 2015 at 9:57 AM, Jason Altekruse 
wrote:

> One of the times this came up I asked about what the requirements would be,
> because pure XML is actually not well suited for placing in a standard SQL
> table, and some of the constructs are even hard to map into the
> JSON/protobuf model we are currently using for complex data in Drill.
>
> I actually do not have experience with the XML support in SQL databases, so
> I might misunderstand what is desired in terms of supporting XML. If there
> is a subset of XML that does fit well into our current model it should not
> be hard to write a format plugin for it.
>
> The question I had was what if someone has us read a directory full of
> XHTML files, what is the output of that query supposed to look like?
>
> Here is the thread from one of the earlier discussions:
> http://mail-archives.apache.org/mod_mbox/drill-user/201503.mbox/browser
>
> If you can share the kind of XML you would like to read that would be
> helpful in defining this task more concretely. We should try to record some
> of this discussion in a JIRA so someone can come along to implement the
> feature, or it can be prioritized for inclusion in an upcoming release.
>
> On Fri, Sep 4, 2015 at 9:37 AM, Harold Richter <
> hrich...@lancetdatasciences.com> wrote:
>
> > Appreciate the quick response.For the use case in front of me today,
> I
> > can use other Drill features to read the data.   Just trying to pick an
> > optimal strategy.
> > Thank you - HR
> >
> >
> > -Original Message-
> > From: Christopher Matta [mailto:cma...@mapr.com]
> > Sent: Friday, September 04, 2015 11:14 AM
> > To: user@drill.apache.org
> > Subject: Re: xml files with Drill
> >
> > People have been asking for this functionality since Drill existed, if it
> > were as straightforward as everyone says I'd think we'd have it working
> by
> > now.
> >
> > On Friday, September 4, 2015, Jim Scott  wrote:
> >
> > > Drill does not support XML.
> > >
> > > I have talked with folks who have tremendous experience with XML
> > > databases and parsing to see if we can wrangle up help getting that
> > > type of functionality in, but at this point it isn't happening yet.
> > >
> > >
> > > On Fri, Sep 4, 2015 at 9:50 AM, Harold Richter <
> > > hrich...@lancetdatasciences.com > wrote:
> > >
> > > > Looking for documentation, example or syntax to read xml files with
> > > Drill.
> > > >
> > > > Is there a way to invoke an xml parser in Drill?
> > > >
> > > > Thanks-
> > > > hrich...@lancetdatasciences.com 
> > > >
> > > >
> > >
> > >
> > > --
> > > *Jim Scott*
> > > Director, Enterprise Strategy & Architecture
> > > +1 (347) 746-9281
> > > @kingmesal 
> > >
> > > 
> > > [image: MapR Technologies] 
> > >
> > > Now Available - Free Hadoop On-Demand Training <
> > > http://www.mapr.com/training?utm_source=Email_medium=Signature
> > > _campaign=Free%20available
> > > >
> > >
> >
> >
> > --
> > Chris Matta
> > cma...@mapr.com
> > 215-701-3146
> >
>


Re: Using Drill JDBC Driver alongside a recent Calcite library

2015-08-19 Thread Parth Chandra
Hi Piotr,
  You might have to wait till 1.2 comes out. There's a patch outstanding
that needs to be updated and merged that should address this issue.

Parth

On Sun, Aug 16, 2015 at 8:06 AM, Piotr Sokólski p...@pyetras.com wrote:

 Hi.
 I’m using calcite-core 1.2 in my project. Importing the Drill driver into
 the classpath seems to cause all kinds of problems, possibly due to
 different versions of Calcite from my package dependencies and the one
 imported from the driver jar. Is it possible to have the two of them
 working alongside each other? I’m not very familiar with Java’s environment
 and I’m using sbt to manage the dependencies.

 --
 Piotr Sokólski




Re: Null values in lists

2015-08-05 Thread Parth Chandra
Hi Parkavi,

This might be a bug in the convert_from function (please log a bug). Try
writing the same json to a file and it should work with all_text_mode set
to true.

Parth





On Wed, Aug 5, 2015 at 12:35 AM, Parkavi Nandagopal parkavi...@hcl.com
wrote:

 Hi,

 How to use null values in list??

 Even I changed `store.json.all_text_mode` = true also it is shouting same
 error.

 Query:

 select convert_from('{abc:[1,2],bvc:[3,4,null,3]}','json') from tab;

 Error: UNSUPPORTED_OPERATION ERROR: Null values are not supported in lists
 by default. Please set `store.json.all_text_mode` to true to read lists
 containing nulls. Be advised that this will treat JSON null values as a
 string containing the word 'null'.

 Line  1
 Column  26
 Field  bvc
 Fragment 0:0

 alter system set `store.json.all_text_mode` = true;

 +---++
 |  ok   |  summary   |
 +---++
 | true  | store.json.all_text_mode updated.  |
 +---++
 1 row selected (0.079 seconds)

 select convert_from('{abc:[1,2],bvc:[3,4,null,3]}','json') from tab;

 Error: UNSUPPORTED_OPERATION ERROR: Null values are not supported in lists
 by default. Please set `store.json.all_text_mode` to true to read lists
 containing nulls. Be advised that this will treat JSON null values as a
 string containing the word 'null'.

 Line  1
 Column  26
 Field  bvc
 Fragment 0:0

 [Error Id: 6c4477ea-a3ea-481a-ab4f-a764a1b57f2b on acesqcxen2:31010]
 (state=,code=0)


 ::DISCLAIMER::

 

 The contents of this e-mail and any attachment(s) are confidential and
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as
 information could be intercepted, corrupted,
 lost, destroyed, arrive late or incomplete, or may contain viruses in
 transmission. The e mail and its contents
 (with or without referred errors) shall therefore not attach any liability
 on the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of the
 author and may not necessarily reflect the
 views or opinions of HCL or its affiliates. Any form of reproduction,
 dissemination, copying, disclosure, modification,
 distribution and / or publication of this message without the prior
 written consent of authorized representative of
 HCL is strictly prohibited. If you have received this email in error
 please delete it and notify the sender immediately.
 Before opening any email and/or attachments, please check them for viruses
 and other defects.


 



Hangout starting now

2015-08-04 Thread Parth Chandra
Hi everyone,

  Please join us for the weekly Drill hangout -

   https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


Parth


Hangout minutes - 2015-07-28

2015-07-29 Thread Parth Chandra
Attendees:  Andries, Daniel, Hanifi, Jacques, Jason,
Jinfeng, Khurram,  Kristine, Mehant , Neeraja, Parth, Sudheesh (host)

Minutes based on notes from Sudeesh -

1) Jacques working on the following -
  a) RPC changes - Sudheesh/Parth reported a regression in perf numbers
which was unexpected. Tests are being rerun.
  b) Apache log - format plugin.
  c) Support for Double quote.
  d) Allow JSON literals.

1) Parquet filter pushdown - Patch from Adam Gilmore is waiting review.
This patch with conflict with Steven's work on metadata caching. Metadata
caching needs to go in first.

2) JDBC storage plugin - Patch from Magnus. Parth to follow up to get
updated code.

3) Discussion on Embedded types -
   a) Two types of common problems are being hit -
1) Soft Schema change - Lots of initial nulls and then a type
appears or the type changes to a type that can be promoted to the initial
type. Drill assumes type to be nullable int if it cannot determine the
type.Discussion on using nullable Varchar/Varbinary instead of nullable
int. Suggestion was that we need to introduce some additional types -
i) Introduce a LATE  binding type ( type is not known).
ii) Introduce a NULL type - only null
   iii) Schema sampling to determine schema- use for fast schema.
2) Hard Schema Change - A schema change that is not transitionable.
   b) Open questions -How do we materialize to the user?  How do
clients expect to handle the schema change events. What does a BI tool like
Tableau do if a new column is introduced. What is the expectation of a
JDBC/ODBC application (what do the standards specify, if anything). Neeraja
to follow up and specify.
   c) Proposal to add support for embedded types where each value carries
type information (covered in DRILL-3228) This requires a detailed design
before we begin implementation.

4) Discussion on 'Insert into' (based on Mehant's post)
   a) In general, the feature is expected to behave like in any database.
Complications arise when the user choses to insert a different schema or
partitions from the the original table.
  b) Jacques's main concern regarding this: Do we want Drill to be flexible
and be able to add columns and be able to not specify columns while
inserting or do we want it to behave like a traditional Data Warehouse
where we do ordinal matching and are strict about the number of columns
being inserted into the target table.
   c) We should validate the schema where we can (eg parquet), however we
should start by validating metadata for queries and use that feature in
Insert as opposed to building that in Insert.
   d) If we allow insert into with a different schema and we cannot read
the file, then that would be embarrassing.
   e) If we are trying to solve a specific BI tool use case for inserts
then we should explore going down the route of solving this specific use
case, and treat the insert like CTAS today.


5) Discussion on 'Drop table'
  a) Strict identification of table - Don't drop tables that Drill can't
query.
  b) Fail if there is a file that does not match.
  c) If no impersonation is enabled then drop only drill owned tables.

   More detailed notes on #4 and #5 to be posted by Jacques.


Re: Querying partitioned Parquet files

2015-07-29 Thread Parth Chandra
Yes that would work too, though if there are inconsistencies in the copies
of files made, then the results would be unreliable.

Parth

On Wed, Jul 29, 2015 at 6:45 PM, Adam Gilmore dragoncu...@gmail.com wrote:

 Just to clarify this, Jason - you don't necessarily need HDFS or the like
 for this, if you had say a NFS volume (for example, Amazon Elastic File
 System), you can still accomplish it, right?  Or merely if you had all
 files duplicated on every node locally.

 On Thu, Jul 30, 2015 at 10:00 AM, Jason Altekruse 
 altekruseja...@gmail.com
 wrote:

  Put a little more simply, the node that we end up planning the query on
 is
  going to enumerate the files we will be reading in the query so that we
 can
  assign work to given nodes. This currently assumes we are going to know
 at
  planning time (on the single node) all of the files to be read. This
  happens to work in a single node setup, because all of the work will be
  done on the single machine against one filesystem (the local fs). In the
  distributed case we currently require that we have a connection from each
  node to a DFS.
 
  There is an outstanding feature request to support a use case like
 querying
  a series of server logs, each machine may have a different number of log
  files. We will need to modify the planning process to allow for the
  description of a scan that is more flexible and allows enumerating the
  files on each machine separately when we go to actually read them.
 
  This JIRA discusses the issue you are facing in more detail, I believe we
  should have one outstanding for the feature request as well. I will try
 to
  take a look around for it and open one if I can't find it soon.
 
  https://issues.apache.org/jira/browse/DRILL-3230
 
  On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn kh...@maprtech.com
 wrote:
 
   Yes, you need a distributed file system to take advantage of Drill's
  query
   planning. If you use multiple Drillbits and do not use a distributed
 file
   system, the consistency of the fragment information cannot be
 maintained.
  
  
  
   Kristine Hahn
   Sr. Technical Writer
   415-497-8107 @krishahn skype:krishahn
  
  
   On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe 
  uwe.geerc...@swissport.com
   
   wrote:
  
Hello,
   
If I have a list of partitioned parquet files on the filesystem and
 two
drillbits with access to the filesystem and I query the data using
 the
column I partitioned on in the where clause of the query, will both
drillbits share the work?
   
Or do I need a distributed filesystem such as Hadoop underlying to
 make
the bits work in parallel (or work together)?
   
Tks.
   
Uwe
   
  
 



Re: Type confusion and number formatting exceptions

2015-07-28 Thread Parth Chandra
Hi Stefan
  This is the same old issue: Drill does an initial scan to determine the
type of a field. In cases where Drill encounters nulls in the data it
defaults to using a Nullable Int as the type (not a good choice perhaps).
  This leads to all sorts of issues (most of which you're hitting).
  There is an effort to improve this (DRILL-3228) but it will be a while
before this work is completed.

  In the meantime, I can only suggest a workaround : use as cast around
your columns -

 select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)),
cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from
`test.json` as p where occurred_at  '2015-07-26' and p.type in
('plan.item.added','plan.item.removed') group by p.type,
coalesce(cast(p.dimensions.dim_type as varchar(20)), cast(p.dimensions.type
as varchar(20)));





On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter ste...@activitystream.com
wrote:

 Hi,

 It seems that null values can trigger a column to be treated as a numeric
 one, in expressions evaluation, regardless of content or other indicators
 and that fields in substructures can affect same-named-fields in parent
 structure.
 (1.2-SNAPSHOT, parquet files)

 I have JSON data that can be reduced to to this:

- {occurred_at:2015-07-26

  
 08:45:41.234,type:plan.item.added,dimensions:{type:null,dim_type:Unspecified,category:Unspecified,sub_category:null}}
- {occurred_at:2015-07-26

  
 08:45:43.598,type:plan.item.removed,dimensions:{type:Unspecified,dim_type:null,category:Unspecified,sub_category:null}}
- {occurred_at:2015-07-26
08:45:44.241,type:plan.item.removed,dimensions:{type:To
See,category:Nature,sub_category:Waterfalls}}

 * notice the discrepancy in the dimensions structure that the type field is
 either called type or dim_type (slightly relevant for the rest of this
 case)


 *1. Query where dimensions are not involved*

 select p.type, count(*) from
 dfs.tmp.`/analytics/processed/some-tenant/events` as p where occurred_at
  '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
 by p.type;
 ++-+
 |type| EXPR$1  |
 ++-+
 | plan.item.removed  | 947 |
 | plan.item.added| 40342   |
 ++-+
 2 rows selected (0.508 seconds)


 *2. Same query but involves dimension.type as well*

 select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type)
 dimensions_type, count(*) from
 dfs.tmp.`/analytics/processed/some-tenant/events` as p where occurred_at
  '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') group
 by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type);

 Error: SYSTEM ERROR: NumberFormatException: To See
 Fragment 2:0
 [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010]
 (state=,code=0)


 I can provide test data if this is not enough to reproduce this bug.

 Regards,
  -Stefán



Drill Hangout (2015-07-21) minutes

2015-07-23 Thread Parth Chandra
Drill Hangout 2015-07-21

Participants: Jacques, Parth (scribe), Sudheesh, Hakim, Khurram, Aman,
Jinfeng, Kristine, Sean

Feature list for Drill 1.2 was discussed. The following items were
considered (disussion/ comments if any are summarized with each item):


   1.

   Memory allocator improvements  - Spillover from 1.1
   2.

   Faster reading of Hive parquet tables - Spillover from 1.1.
   3.

   Enhance/cleanup test framework  publish to community

The dev team has a set of tests that are a requirement for committing.
These tests should be made available so that community contributors can run
them independently.

   1.

   Additional window functions - Dev - NTILE, LEAD, LAG, FIRST_VALUE,
   LAST_VALUE
   2.

   Faster metadata read for 1000s of Parquet file
   3.

   Rowkey filter pushdown improvements
   4.

   Support Insert into T Select From

 A longish discussion on this. (I might have missed some points). Concerns
about how to maintain the previous table metadata, especially partition
metadata. There was a discussion around allowing a table to have files
where some files have different or no partitions from the partitions
defined when the table was first created. Suggestion to incorporate the
metadata in a .drill file.

Some more clarity to be provided about the functionality. More discussion
to continue in the JIRA.

   1.

   Support Drop table

A small discussion on some restrictions that will need to be imposed. In
particular, not allow access to root (/) and also that we should probably
validate that the table being dropped is, in fact, a table.

   1.

   Security for the WEB UI

Some considerations that we need to consider are whether we need to enable
SSL and some form of authentication/authorization for the web UI. Also
whether we need to consider the same for the REST API. A second question is
whether we should include the need to limit access to workspaces defined in
the web UI. One suggestion was whether we need to create workspaces similar
to views (i.e defined in a .drill file) and then use the same access
control mechanisms (i.e the one provided by the file system).

   1.

   JDBC driver
   2.

   Super bugs - flatten

  There was a discussion on whether we need to consider flatten to be an
instance of a User Defined Table Function. The issues we see in flatten are
related less to the flatten logic and more to handling edge cases of batch
boundaries, and vectors. The idea behind supporting UDTFs would be to write
the framework that handles the complexity of handling input and producing
output and the UDTF itself would need to implement an input and an output
operator. Flatten can then be reimplemented as a UDTF.

   1.

   Super bugs - convert function/
   2.

   Super bug - 2010: MergeJoin incorrect results. ( Suggestion that the
   solution might be to go the UDTF way)
   3.

   Super bug - DRILL-3121 Support interpreter based execution for hive
   partition pruning.


Hangout happening now

2015-07-21 Thread Parth Chandra
Come join the Drill community as we discuss what has been happening lately
and what is in the pipeline. All are welcome, if you know about Drill, want
to know more or just want to listen in.

Link: https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks


Re: Double quotes in alias name not working

2015-07-17 Thread Parth Chandra
+1.


On Fri, Jul 17, 2015 at 9:31 AM, Jacques Nadeau jacq...@apache.org wrote:

 Jinfeng, that is a great idea.  Do you want to open an enhancement request
 and we'll see how many votes that feature gets?

 On Fri, Jul 17, 2015 at 7:46 AM, Jinfeng Ni jinfengn...@gmail.com wrote:

  backtick (“`”) is used in MySQL as identifier quotes[1]. Drill uses the
  same way as MySQL.
 
  I think you made a good point. Probably Drill should add an ANSI_QUOTES
  option,
  just like what MySql does. That way, user does not have to change the
  quotes in their existing query.
 
  1. http://dev.mysql.com/doc/refman/5.7/en/identifiers.html
 
  On Fri, Jul 17, 2015 at 6:31 AM, Siva B sivd...@outlook.com wrote:
 
   I hope drill should supports ANSI SQL. Any particular reason for not
   supporting quotes.
   I am trying to connect drill with my legacy application. It has huge
 set
   of queries. I can't change every quotes with bachtick.
   Thanks.
  
Date: Fri, 17 Jul 2015 05:41:06 -0700
Subject: Re: Double quotes in alias name not working
From: adene...@maprtech.com
To: user@drill.apache.org
   
you only need to use backtick ` symbol when you want to use a
 reserved
word (`user` in this case). I don't think Drill supports double
  quotes.
   
Is there a specific reason you want to use double quotes instead of
backticks ?
   
On Fri, Jul 17, 2015 at 5:23 AM, Siva B sivd...@outlook.com wrote:
   
 Hi,
 Why drill not accepting Column alias name with quotes is not
 working.
   It
 works only with ` symbol.
 Working Query: SELECT id, name AS `user` from users;
 Actual Query: SELECT id, name AS user from users; [How to make it
   this
 works.]
 Please share workaround for this scenario.
 Thanks

   
   
   
   
--
   
Abdelhakim Deneche
   
Software Engineer
   
  http://www.mapr.com/
   
   
Now Available - Free Hadoop On-Demand Training

  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
   
  
  
 



Re: Parsing exception when querying multiple files in a directory

2015-06-30 Thread Parth Chandra
I think there is no way at the moment to determine which file the error was
in except to grep. I've logged a JIRA for this (DRILL-3428).

On Mon, Jun 29, 2015 at 1:41 AM, Chi-Lang Ngo chil...@gmail.com wrote:

 I'm getting this exception while parsing exception when querying a
 directory with thousands of tsv files:

 ...TextParsingException: Error processing input: Cannot use newline
 character within quoted string, line=37, char=8855. Content parsed: [ ]

 Is there a way to find out which file caused the exception w/o grep-ing
 through all of them?



Hangout happening now

2015-06-23 Thread Parth Chandra
Come join the Drill community as we discuss what has been happening lately
and what is in the pipeline. All are welcome, if you know about Drill, want
to know more or just want to listen in.

Link: https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks