MapRFS

Jacques Nadeau Mon, 07 Mar 2016 19:15:22 -0800

The flattening is surprising unless we're spending a long time in query
setup. (This is shown by looking at the query start time for the 0-0
fragment in the query profile screen.) If you share the profile json files,
we can also take a look and see what is up.


thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson <eric...@gmail.com> wrote:

> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>   { "group1":
>      { "field1": 42,
>        "field2: [ "a", "b", "c" ],
>        ...
>      }
>    { "group2":
>       ....
>    }
>    ...
> }
>
> There are about 500 objects like this in each JSON file.
>
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
>
> I wanted to see how the number of Drillbits would impact the query time, to
> try to extrapolate to the number of servers needed to reach a performance
> number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
>
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
>
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
>
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
>
> Thanks,
>
>
>
> --
> Sent from Gmail Mobile
>

Re: Parallelism / data locality in HDFS/MapRFS

Reply via email to