MapRFS

Eric Pederson Tue, 08 Mar 2016 14:03:06 -0800

Here is the query plan JSON for the same query against the Parquet file
that I CTASed: https://gist.github.com/sourcedelica/b05eeaf5df9e63b29654.
It took 1794.618 seconds.



-- Eric

On Tue, Mar 8, 2016 at 1:44 PM, Eric Pederson <[email protected]> wrote:

> Hi everyone -
>
> Thanks for your feedback.   Answers to your questions below.
>
> The query plan JSON for the JSON query (where the performance flattened
> out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a.
> This was the plan with all three Drillbits running.
>
> Some more detail on the structure of the JSON.  There are 8 objects at the
> first level.  The biggest one has a little over 500 fields, the majority of
> them being arrays of numbers or arrays of strings.  The next biggest group
> contains around 300 flat objects.  The rest of the groups are fairly small,
> 20-40 fields.
>
> I didn't do any casting in the CTAS from JSON to Parquet due to the sheer
> number of fields. :)
>
> Thanks,
>
>
>
> -- Eric
>
> On Mon, Mar 7, 2016 at 6:02 PM, Eric Pederson <[email protected]> wrote:
>
>> We are using MapR M3 and are querying multiple JSON files - around 250
>> files at 1.5 GB per file.   We have a small cluster of three machines
>> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
>> like:
>> {
>>   { "group1":
>>      { "field1": 42,
>>        "field2: [ "a", "b", "c" ],
>>        ...
>>      }
>>    { "group2":
>>       ....
>>    }
>>    ...
>> }
>>
>> There are about 500 objects like this in each JSON file.
>>
>> I've been testing a set of queries that scan all of the data (we're
>> investigating a partitioning strategy but haven't settled on one that will
>> fit all of our queries which are fairly ad-hoc).   These full-scan queries
>> typically take 1 minute, 20 seconds using the default settings  If I limit
>> the query to a single file the query takes a few seconds.
>>
>> I wanted to see how the number of Drillbits would impact the query time,
>> to try to extrapolate to the number of servers needed to reach a
>> performance number.   Here are the numbers that we saw:
>> - 1 Drillbit: 3:45
>> - 2 Drillbits: 1:56
>> - 3 Drillbits: 1:20
>>
>> The performance flattens out between two and three Drillbits.   I was
>> surprised to see that, given the single file query performance.  I was
>> hoping to throw hardware at the performance a bit more.   Is that
>> surprising to you?
>>
>> A somewhat related question.  Does Drill take advantage of HDFS
>> locality?  That is, will it send certain fragments to certain boxes because
>> it knows those boxes have the data replicated locally?  Actually in our
>> setup (3 servers) that might be a moot point assuming every box has all
>> blocks.  I'm not sure if MapRFS changes that.
>>
>> I also tried converting the JSON files to Parquet using CTAS.  The
>> Parquet queries took much longer than the JSON queries.  Is that expected
>> as well?
>>
>> Thanks,
>>
>>
>>
>> --
>> Sent from Gmail Mobile
>>
>
>

Re: Parallelism / data locality in HDFS/MapRFS

Reply via email to