Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread Eric Pederson
Hi everyone -

Thanks for your feedback.   Answers to your questions below.

The query plan JSON for the JSON query (where the performance flattened
out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a.
This was the plan with all three Drillbits running.

Some more detail on the structure of the JSON.  There are 8 objects at the
first level.  The biggest one has a little over 500 fields, the majority of
them being arrays of numbers or arrays of strings.  The next biggest group
contains around 300 flat objects.  The rest of the groups are fairly small,
20-40 fields.

I didn't do any casting in the CTAS from JSON to Parquet due to the sheer
number of fields. :)

Thanks,



-- Eric

On Mon, Mar 7, 2016 at 6:02 PM, Eric Pederson  wrote:

> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>   { "group1":
>  { "field1": 42,
>"field2: [ "a", "b", "c" ],
>...
>  }
>{ "group2":
>   
>}
>...
> }
>
> There are about 500 objects like this in each JSON file.
>
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
>
> I wanted to see how the number of Drillbits would impact the query time,
> to try to extrapolate to the number of servers needed to reach a
> performance number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
>
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
>
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
>
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
>
> Thanks,
>
>
>
> --
> Sent from Gmail Mobile
>


Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread Jason Altekruse
Considering your description of the data, 1.5 GB per file with only 500
records in each give you somewhere around 30 MB records. This in itself
doesn't necessarily cause an issue, but the structure of your example
record makes me think you may have many individual columns in the nested
structure, rather than the record size dominated by long lists or strings.

If this is the case, you might just be hitting overhead of storing a very
complex structure in a columnar system. Parquet is columnar on disk, but
Drill also uses a columnar in-memory structure to store records during
processing. Even with a conservative estimate of 1 kb for each value stored
somewhere in your nested structure, that gives you 30,000 columns. Due to
how Drill stores them, nested fields share most of the same status as
fields at the root of your schema. While this works, it might require some
tuning work to make it efficient.

Can you give us an idea of how many groups and fields are in a typical
record? How many of the fields are lists? Are the lists that appear in your
data much larger than in your example here?


On Tue, Mar 8, 2016 at 5:51 AM, John Omernik  wrote:

> The slowness you saw with Parquet can be heavily dependent on on how your
> CTAS was written.  Did you cast to types as needed? Drill could be making
> some fast and loose assumptions about your data, and thus typing
> incorrectly.  When I was in a similar scenario, I used some stronger typing
> and saw quite a bit of improvement with the Parquet files. This can be
> difficult if everything is nested though, your mileage may vary.
>
> As to Jacques comment, the profile.json is important, if they are large
> files and the planner is going through lots of files, that may make up the
> bulk of your data.   A partitioning strategy, should you be able to find
> one can help here, but there are still some issues that can crop up.  I
> think the planning inefficiencies are being worked on.
>
> John
>
> On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunning  wrote:
>
> > > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson 
> wrote:
> > > I also tried converting the JSON files to Parquet using CTAS.  The
> > Parquet
> > > queries took much longer than the JSON queries.  Is that expected as
> > well?
> >
> > No. That is not expected.
> >
>


Re: Parallelism / data locality in HDFS/MapRFS

2016-03-08 Thread John Omernik
The slowness you saw with Parquet can be heavily dependent on on how your
CTAS was written.  Did you cast to types as needed? Drill could be making
some fast and loose assumptions about your data, and thus typing
incorrectly.  When I was in a similar scenario, I used some stronger typing
and saw quite a bit of improvement with the Parquet files. This can be
difficult if everything is nested though, your mileage may vary.

As to Jacques comment, the profile.json is important, if they are large
files and the planner is going through lots of files, that may make up the
bulk of your data.   A partitioning strategy, should you be able to find
one can help here, but there are still some issues that can crop up.  I
think the planning inefficiencies are being worked on.

John

On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunning  wrote:

> > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson  wrote:
> > I also tried converting the JSON files to Parquet using CTAS.  The
> Parquet
> > queries took much longer than the JSON queries.  Is that expected as
> well?
>
> No. That is not expected.
>


Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Ted Dunning
> On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson  wrote:
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?

No. That is not expected.


Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Jacques Nadeau
The flattening is surprising unless we're spending a long time in query
setup. (This is shown by looking at the query start time for the 0-0
fragment in the query profile screen.) If you share the profile json files,
we can also take a look and see what is up.

thanks,
Jacques


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson  wrote:

> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>   { "group1":
>  { "field1": 42,
>"field2: [ "a", "b", "c" ],
>...
>  }
>{ "group2":
>   
>}
>...
> }
>
> There are about 500 objects like this in each JSON file.
>
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
>
> I wanted to see how the number of Drillbits would impact the query time, to
> try to extrapolate to the number of servers needed to reach a performance
> number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
>
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
>
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
>
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
>
> Thanks,
>
>
>
> --
> Sent from Gmail Mobile
>


Re: Parallelism / data locality in HDFS/MapRFS

2016-03-07 Thread Andries Engelbrecht
You may want to look at the query plan between the 3 scenarios to see which 
operators time is spend on and how well they are parallelized. 

The expectation would be that Parquet will perform better than JSON.

--Andries


> On Mar 7, 2016, at 3:02 PM, Eric Pederson  wrote:
> 
> We are using MapR M3 and are querying multiple JSON files - around 250
> files at 1.5 GB per file.   We have a small cluster of three machines
> running Drill 1.4.  The JSON is nested three-four levels deep, in a format
> like:
> {
>  { "group1":
> { "field1": 42,
>   "field2: [ "a", "b", "c" ],
>   ...
> }
>   { "group2":
>  
>   }
>   ...
> }
> 
> There are about 500 objects like this in each JSON file.
> 
> I've been testing a set of queries that scan all of the data (we're
> investigating a partitioning strategy but haven't settled on one that will
> fit all of our queries which are fairly ad-hoc).   These full-scan queries
> typically take 1 minute, 20 seconds using the default settings  If I limit
> the query to a single file the query takes a few seconds.
> 
> I wanted to see how the number of Drillbits would impact the query time, to
> try to extrapolate to the number of servers needed to reach a performance
> number.   Here are the numbers that we saw:
> - 1 Drillbit: 3:45
> - 2 Drillbits: 1:56
> - 3 Drillbits: 1:20
> 
> The performance flattens out between two and three Drillbits.   I was
> surprised to see that, given the single file query performance.  I was
> hoping to throw hardware at the performance a bit more.   Is that
> surprising to you?
> 
> A somewhat related question.  Does Drill take advantage of HDFS locality?
> That is, will it send certain fragments to certain boxes because it knows
> those boxes have the data replicated locally?  Actually in our setup (3
> servers) that might be a moot point assuming every box has all blocks.  I'm
> not sure if MapRFS changes that.
> 
> I also tried converting the JSON files to Parquet using CTAS.  The Parquet
> queries took much longer than the JSON queries.  Is that expected as well?
> 
> Thanks,
> 
> 
> 
> -- 
> Sent from Gmail Mobile