You may want to look at the query plan between the 3 scenarios to see which 
operators time is spend on and how well they are parallelized. 

The expectation would be that Parquet will perform better than JSON.

--Andries


> On Mar 7, 2016, at 3:02 PM, Eric Pederson <eric...@gmail.com> wrote:
> 
> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>  { "group1":
>     { "field1": 42,
>       "field2: [ "a", "b", "c" ],
>       ...
>     }
>   { "group2":
>      ....
>   }
>   ...
> }
> 
> There are about 500 objects like this in each JSON file.
> 
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
> 
> I wanted to see how the number of Drillbits would impact the query time, to
> try to extrapolate to the number of servers needed to reach a performance
> number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
> 
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
> 
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
> 
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
> 
> Thanks,
> 
> 
> 
> -- 
> Sent from Gmail Mobile

Reply via email to