The flattening is surprising unless we're spending a long time in query setup. (This is shown by looking at the query start time for the 0-0 fragment in the query profile screen.) If you share the profile json files, we can also take a look and see what is up.
thanks, Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson <eric...@gmail.com> wrote: > We are using MapR M3 and are querying multiple JSON files - around 250 > files at 1.5 GB per file. We have a small cluster of three machines > running Drill 1.4. The JSON is nested three-four levels deep, in a format > like: > { > { "group1": > { "field1": 42, > "field2: [ "a", "b", "c" ], > ... > } > { "group2": > .... > } > ... > } > > There are about 500 objects like this in each JSON file. > > I've been testing a set of queries that scan all of the data (we're > investigating a partitioning strategy but haven't settled on one that will > fit all of our queries which are fairly ad-hoc). These full-scan queries > typically take 1 minute, 20 seconds using the default settings If I limit > the query to a single file the query takes a few seconds. > > I wanted to see how the number of Drillbits would impact the query time, to > try to extrapolate to the number of servers needed to reach a performance > number. Here are the numbers that we saw: > - 1 Drillbit: 3:45 > - 2 Drillbits: 1:56 > - 3 Drillbits: 1:20 > > The performance flattens out between two and three Drillbits. I was > surprised to see that, given the single file query performance. I was > hoping to throw hardware at the performance a bit more. Is that > surprising to you? > > A somewhat related question. Does Drill take advantage of HDFS locality? > That is, will it send certain fragments to certain boxes because it knows > those boxes have the data replicated locally? Actually in our setup (3 > servers) that might be a moot point assuming every box has all blocks. I'm > not sure if MapRFS changes that. > > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? > > Thanks, > > > > -- > Sent from Gmail Mobile >