Parquet Block Size Detection

2016-07-01 Thread John Omernik
Is there any way, with Drill or with other tools, given a Parquet file, to detect the block size it was written with? I am copying data from one cluster to another, and trying to determine the block size. While I was able to get the size by asking the devs, I was wondering, is there any way to

Re: Information about ENQUEUED state in Drill

2016-07-01 Thread John Omernik
Interestingly enough, when I disable queuing, the query sits in the "STARTING" phase for the same amount of time it would sit in ENQUEUING if queuing was enabled. Excessive planning? When looking at the UI, how can I validate this? On Fri, Jul 1, 2016 at 8:14 AM, John Omernik

Re: User Impersonation

2016-07-01 Thread scott
Am I able to open a Jira ticket on this? Or, is this something a developer has to do? Scott On Thu, Jun 30, 2016 at 5:17 PM, scott wrote: > Impersonation using the default dfs configuration is not supported? The > documentation for Impersonation Support says that File

Re: Performance querying a single column out of a parquet file

2016-07-01 Thread John Omernik
Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and asked if Drill could do this. (Late/Lazy Materialization of columns). While the performance gain on tables with less columns may not be huge , when you are looking at really wide tables, with disparate date types, this can be

Re: Information about ENQUEUED state in Drill

2016-07-01 Thread John Omernik
I don't see that, but here's a question, when it's enqueued, it must have to do some level of planning before determining which queue it's going to fall into ... correct? I wonder if that planning takes to long, if that's what's causing the enqueued state? On Thu, Jun 30, 2016 at 1:09 PM,

Re: Question about querying array fields in parquet files

2016-07-01 Thread David Kincaid
Thanks, Kunal. Yes, my issue looks very similar to that one, only I'm reading from Parquet files. I am creating Avro records which I am writing into the Parquet files, but in the end Drill is just reading the Parquet files. Should I file a Jira issue that is more specific to the problem I'm

Re: User Impersonation

2016-07-01 Thread Ted Dunning
You can certainly open a ticket for this. The problem is resolving the ticket. Impersonation explicitly depends on some of the functionality available with certain distributed file systems like MapR FS and HDFS. This allows a drill bit to just do file system level impersonation based on

Re: User Impersonation

2016-07-01 Thread scott
Great explanation, Ted. I think I will still open the ticket, if nothing else to address the gaps in documentation. Thanks, Scott On Fri, Jul 1, 2016 at 2:30 PM, Ted Dunning wrote: > You can certainly open a ticket for this. > > The problem is resolving the ticket. > >

Re: gzipped json files not named .json.gz

2016-07-01 Thread Scott Kinney
Hi Jason, Thanks for getting back to me. We were able to get the spark job to append the .json.gz so we are ok for now. I tried working with local files of json. Drill will not query it if it's not named .json. I didn't try gzipped. But since we got them renamed in s3 I'm out of the woods.

array in json with mixed values (int and float)

2016-07-01 Thread Scott Kinney
When running a query on a json file via the api returns an error that i dont see when running the same query in the REPL. "errorMessage" : "UNSUPPORTED_OPERATION ERROR: In a list of type FLOAT8, encountered a value of type BIGINT. Drill does not support lists of different types.\n\nFile

Re: Information about ENQUEUED state in Drill

2016-07-01 Thread Abdel Hakim Deneche
Most likely planing is taking longer to finish. Once it's done, it should move to either ENQUEUED if the queuing was enabled or RUNNING if it was disabled. One easy way to confirm if planing is indeed taking too long is to just run a "EXPLAIN PLAN FOR " and see how long it takes to finish. On

Re: Performance querying a single column out of a parquet file

2016-07-01 Thread Parth Chandra
This has come up in the past in some other context. At the moment though, there is no JIRA for this. On Fri, Jul 1, 2016 at 6:10 AM, John Omernik wrote: > Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and > asked if Drill could do this. (Late/Lazy

Re: Parquet Block Size Detection

2016-07-01 Thread Parth Chandra
parquet-tools perhaps? https://github.com/Parquet/parquet-mr/tree/master/parquet-tools On Fri, Jul 1, 2016 at 5:39 AM, John Omernik wrote: > Is there any way, with Drill or with other tools, given a Parquet file, to > detect the block size it was written with? I am copying

Re: Parquet Block Size Detection

2016-07-01 Thread Abdel Hakim Deneche
some answers inline: On Fri, Jul 1, 2016 at 10:56 AM, John Omernik wrote: > I looked at that, and both the meta and schema options didn't provide me > block size. > > I may be looking at parquet block size wrong, so let me toss out some > observations, and inferences I am

Re: Parquet Block Size Detection

2016-07-01 Thread John Omernik
I looked at that, and both the meta and schema options didn't provide me block size. I may be looking at parquet block size wrong, so let me toss out some observations, and inferences I am making, and then others who know the spec/format can confirm or correct. 1. The block size in parquet is

Re: User Impersonation

2016-07-01 Thread Paul Rogers
In the DB world, this issue was resolved by running the DB server as a privileged user. The user then logs into the DB to do work. This means that the DB itself is trusted, but clients are not. Of course, DBs have their own user system so that users are defined in the DB; the DB user owns all

Re: array in json with mixed values (int and float)

2016-07-01 Thread Parth Chandra
I haven't tried this myself, but setting store.json.read_numbers_as_double to true might help. On Fri, Jul 1, 2016 at 9:27 AM, Scott Kinney wrote: > When running a query on a json file via the api returns an error that i > dont see when running the same query in the

Re: Performance querying a single column out of a parquet file

2016-07-01 Thread John Omernik
I created a JIRA for discussion. This could be a huge performance win if it were possible. https://issues.apache.org/jira/browse/DRILL-4758 On Fri, Jul 1, 2016 at 12:33 PM, Parth Chandra wrote: > This has come up in the past in some other context. At the moment though,

Re: array in json with mixed values (int and float)

2016-07-01 Thread Scott Kinney
it didn't work when i did an alter session via the api but worked then i did and alter system via the repl. I'm guessing each query via the api is a session to alter sessions via the api only last for that one call? Anywho, that did the trick Parth, thank you!

Re: AWS EMR bootstrap script to install and configure Drill

2016-07-01 Thread David Kincaid
Thanks, Paul. This does look like a good place to start. Unfortunately, it fails right off the bat due to the emr/common library not being available. Not being a Ruby guy, I'm not sure where to go from here. Is there some package that I can easily install to get that library? - Dave Here's the

Re: array in json with mixed values (int and float)

2016-07-01 Thread Scott Kinney
That looks promising but didn't work. Scott Kinney | DevOps stem | m 510.282.1299 100 Rollins Road, Millbrae, California 94030 This e-mail and/or any attachments contain Stem, Inc. confidential and proprietary information and material for the sole use of

Re: Parquet Block Size Detection

2016-07-01 Thread John Omernik
In addition 7. Generally speaking, keeping number of files low, will help in multiple phases of planning/execution. True/False On Fri, Jul 1, 2016 at 12:56 PM, John Omernik wrote: > I looked at that, and both the meta and schema options didn't provide me > block size. > > I

Re: Parquet Block Size Detection

2016-07-01 Thread Parth Chandra
For metadata, you can use 'parquet-tools dump' and pipe the output to more/less. Parquet dump will print the block (aka row group) and page level metadata. It will then dump all the data so be prepared to cancel when that happens. Setting dfs.blocksize == parquet.blocksize is a very good idea and

Re: Information about ENQUEUED state in Drill

2016-07-01 Thread John Omernik
Yes, the planning is taking a long time. That is the issue. So, when queuing is enabled. "Planning" happens when the status is ENQUEUED. IF Queuing is not enabled, "Planning" happens when the status is "STARTING". (Based on my observations). Are there any good docs, or sources to look at why

Re: Parquet Block Size Detection

2016-07-01 Thread Abdel Hakim Deneche
Just make sure you enable parquet metadata caching, otherwise the more files you have the more time Drill will spend reading the metadata from every single file. On Fri, Jul 1, 2016 at 11:17 AM, John Omernik wrote: > In addition > 7. Generally speaking, keeping number of files

Re: array in json with mixed values (int and float)

2016-07-01 Thread Parth Chandra
If you mean the REST api, then yes, there is no session maintained unless impersonation is enabled. In that case the alter session would not have any effect. On Fri, Jul 1, 2016 at 11:58 AM, Scott Kinney wrote: > it didn't work when i did an alter session via the api

Re: Querying Parquet: Filtering on a sorted column

2016-07-01 Thread rahul challapalli
This is something which is not currently supported. The "parquet filter pushdown" feature should be able to achieve this. Its still under development. - Rahul On Fri, Jul 1, 2016 at 12:10 PM, Dan Wild wrote: > Hi, > > I'm attempting to query a directory of parquet files

Re: Parquet Block Size Detection

2016-07-01 Thread John Omernik
I am looking forward to the MapR 1.7 dev preview because of the metadata user impersonation JIRA fix. "Drill always writes one row group per file." So is this one parquet block? "row group" is a new term to this email :) On Fri, Jul 1, 2016 at 2:09 PM, Abdel Hakim Deneche

Re: User Impersonation

2016-07-01 Thread Keys Botzum
The way I'd answer the question is that if you need authorization to be enforced by the underlying data store, then the data store must have the capability of inbound impersonation. Over time, many storage systems have added that function. There was a time in the not too distant past when many

Re: Information about ENQUEUED state in Drill

2016-07-01 Thread Parth Chandra
The plan itself may have a hint as to why it took so long. One reason is if there is a very large number of files and Drill is reading file metadata for every file during the planning stage. This operation is not distributed and can sometimes become a bottleneck. On Fri, Jul 1, 2016 at 10:44 AM,

Re: missing data in json structure when using web / api

2016-07-01 Thread John Omernik
Are you using options that are maintained in the cli but not the rest API due to a lack of impersonation? On Friday, July 1, 2016, Scott Kinney wrote: > When i query from sqlline i can see all the data, very complicated / > nested json structure but when i query with the

Re: missing data in json structure when using web / api

2016-07-01 Thread Scott Kinney
Not that I know of but I'm new to drill. I've done 'alter system' for json all_text_mode & read_numbers_as_double. Do you know of a setting that might cause something like this? Scott Kinney | DevOps stem | m 510.282.1299 100 Rollins Road, Millbrae,

Re: User Impersonation

2016-07-01 Thread Ted Dunning
On Fri, Jul 1, 2016 at 11:50 AM, Paul Rogers wrote: > All of this is a long-winded way of asking this: What do other “big data” > tools do to solve this problem? If one is doing big data, should a > distributed file system be a requirement if one wants security? > Other

Querying Parquet: Filtering on a sorted column

2016-07-01 Thread Dan Wild
Hi, I'm attempting to query a directory of parquet files that are partitioned on column A (int) and sorted on column B (also int). When I run a query such as SELECT * FROM mydirectory WHERE A = 123 AND B = 456, I can see that the physical query plan is using the criteria on A to choose the

missing data in json structure when using web / api

2016-07-01 Thread Scott Kinney
When i query from sqlline i can see all the data, very complicated / nested json structure but when i query with the api or the web ui a lot of the data is missing. ? Scott Kinney | DevOps stem | m 510.282.1299 100 Rollins Road,

Re: AWS EMR bootstrap script to install and configure Drill

2016-07-01 Thread Paul Mogren
Did you by chance attempt to run it outside of EMR? AWS is known to provide a local install of EMR integration libraries on the nodes, but not publish them to OSS repositories. This was derived from the old https://github.com/awslabs/emr-bootstrap-actions/tree/master/drill and worked for us