Re: Parallelism / data locality in HDFS/MapRFS
Hi everyone - Thanks for your feedback. Answers to your questions below. The query plan JSON for the JSON query (where the performance flattened out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a. This was the plan with all three Drillbits running. Some more detail on the structure of the JSON. There are 8 objects at the first level. The biggest one has a little over 500 fields, the majority of them being arrays of numbers or arrays of strings. The next biggest group contains around 300 flat objects. The rest of the groups are fairly small, 20-40 fields. I didn't do any casting in the CTAS from JSON to Parquet due to the sheer number of fields. :) Thanks, -- Eric On Mon, Mar 7, 2016 at 6:02 PM, Eric Pedersonwrote: > We are using MapR M3 and are querying multiple JSON files - around 250 > files at 1.5 GB per file. We have a small cluster of three machines > running Drill 1.4. The JSON is nested three-four levels deep, in a format > like: > { > { "group1": > { "field1": 42, >"field2: [ "a", "b", "c" ], >... > } >{ "group2": > >} >... > } > > There are about 500 objects like this in each JSON file. > > I've been testing a set of queries that scan all of the data (we're > investigating a partitioning strategy but haven't settled on one that will > fit all of our queries which are fairly ad-hoc). These full-scan queries > typically take 1 minute, 20 seconds using the default settings If I limit > the query to a single file the query takes a few seconds. > > I wanted to see how the number of Drillbits would impact the query time, > to try to extrapolate to the number of servers needed to reach a > performance number. Here are the numbers that we saw: > - 1 Drillbit: 3:45 > - 2 Drillbits: 1:56 > - 3 Drillbits: 1:20 > > The performance flattens out between two and three Drillbits. I was > surprised to see that, given the single file query performance. I was > hoping to throw hardware at the performance a bit more. Is that > surprising to you? > > A somewhat related question. Does Drill take advantage of HDFS locality? > That is, will it send certain fragments to certain boxes because it knows > those boxes have the data replicated locally? Actually in our setup (3 > servers) that might be a moot point assuming every box has all blocks. I'm > not sure if MapRFS changes that. > > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? > > Thanks, > > > > -- > Sent from Gmail Mobile >
Re: Parallelism / data locality in HDFS/MapRFS
Considering your description of the data, 1.5 GB per file with only 500 records in each give you somewhere around 30 MB records. This in itself doesn't necessarily cause an issue, but the structure of your example record makes me think you may have many individual columns in the nested structure, rather than the record size dominated by long lists or strings. If this is the case, you might just be hitting overhead of storing a very complex structure in a columnar system. Parquet is columnar on disk, but Drill also uses a columnar in-memory structure to store records during processing. Even with a conservative estimate of 1 kb for each value stored somewhere in your nested structure, that gives you 30,000 columns. Due to how Drill stores them, nested fields share most of the same status as fields at the root of your schema. While this works, it might require some tuning work to make it efficient. Can you give us an idea of how many groups and fields are in a typical record? How many of the fields are lists? Are the lists that appear in your data much larger than in your example here? On Tue, Mar 8, 2016 at 5:51 AM, John Omernikwrote: > The slowness you saw with Parquet can be heavily dependent on on how your > CTAS was written. Did you cast to types as needed? Drill could be making > some fast and loose assumptions about your data, and thus typing > incorrectly. When I was in a similar scenario, I used some stronger typing > and saw quite a bit of improvement with the Parquet files. This can be > difficult if everything is nested though, your mileage may vary. > > As to Jacques comment, the profile.json is important, if they are large > files and the planner is going through lots of files, that may make up the > bulk of your data. A partitioning strategy, should you be able to find > one can help here, but there are still some issues that can crop up. I > think the planning inefficiencies are being worked on. > > John > > On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunning wrote: > > > > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson > wrote: > > > I also tried converting the JSON files to Parquet using CTAS. The > > Parquet > > > queries took much longer than the JSON queries. Is that expected as > > well? > > > > No. That is not expected. > > >
Re: Parallelism / data locality in HDFS/MapRFS
The slowness you saw with Parquet can be heavily dependent on on how your CTAS was written. Did you cast to types as needed? Drill could be making some fast and loose assumptions about your data, and thus typing incorrectly. When I was in a similar scenario, I used some stronger typing and saw quite a bit of improvement with the Parquet files. This can be difficult if everything is nested though, your mileage may vary. As to Jacques comment, the profile.json is important, if they are large files and the planner is going through lots of files, that may make up the bulk of your data. A partitioning strategy, should you be able to find one can help here, but there are still some issues that can crop up. I think the planning inefficiencies are being worked on. John On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunningwrote: > > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson wrote: > > I also tried converting the JSON files to Parquet using CTAS. The > Parquet > > queries took much longer than the JSON queries. Is that expected as > well? > > No. That is not expected. >
Re: Parallelism / data locality in HDFS/MapRFS
> On Mon, Mar 7, 2016 at 3:02 PM, Eric Pedersonwrote: > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? No. That is not expected.
Re: Parallelism / data locality in HDFS/MapRFS
The flattening is surprising unless we're spending a long time in query setup. (This is shown by looking at the query start time for the 0-0 fragment in the query profile screen.) If you share the profile json files, we can also take a look and see what is up. thanks, Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Mar 7, 2016 at 3:02 PM, Eric Pedersonwrote: > We are using MapR M3 and are querying multiple JSON files - around 250 > files at 1.5 GB per file. We have a small cluster of three machines > running Drill 1.4. The JSON is nested three-four levels deep, in a format > like: > { > { "group1": > { "field1": 42, >"field2: [ "a", "b", "c" ], >... > } >{ "group2": > >} >... > } > > There are about 500 objects like this in each JSON file. > > I've been testing a set of queries that scan all of the data (we're > investigating a partitioning strategy but haven't settled on one that will > fit all of our queries which are fairly ad-hoc). These full-scan queries > typically take 1 minute, 20 seconds using the default settings If I limit > the query to a single file the query takes a few seconds. > > I wanted to see how the number of Drillbits would impact the query time, to > try to extrapolate to the number of servers needed to reach a performance > number. Here are the numbers that we saw: > - 1 Drillbit: 3:45 > - 2 Drillbits: 1:56 > - 3 Drillbits: 1:20 > > The performance flattens out between two and three Drillbits. I was > surprised to see that, given the single file query performance. I was > hoping to throw hardware at the performance a bit more. Is that > surprising to you? > > A somewhat related question. Does Drill take advantage of HDFS locality? > That is, will it send certain fragments to certain boxes because it knows > those boxes have the data replicated locally? Actually in our setup (3 > servers) that might be a moot point assuming every box has all blocks. I'm > not sure if MapRFS changes that. > > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? > > Thanks, > > > > -- > Sent from Gmail Mobile >
Re: Parallelism / data locality in HDFS/MapRFS
You may want to look at the query plan between the 3 scenarios to see which operators time is spend on and how well they are parallelized. The expectation would be that Parquet will perform better than JSON. --Andries > On Mar 7, 2016, at 3:02 PM, Eric Pedersonwrote: > > We are using MapR M3 and are querying multiple JSON files - around 250 > files at 1.5 GB per file. We have a small cluster of three machines > running Drill 1.4. The JSON is nested three-four levels deep, in a format > like: > { > { "group1": > { "field1": 42, > "field2: [ "a", "b", "c" ], > ... > } > { "group2": > > } > ... > } > > There are about 500 objects like this in each JSON file. > > I've been testing a set of queries that scan all of the data (we're > investigating a partitioning strategy but haven't settled on one that will > fit all of our queries which are fairly ad-hoc). These full-scan queries > typically take 1 minute, 20 seconds using the default settings If I limit > the query to a single file the query takes a few seconds. > > I wanted to see how the number of Drillbits would impact the query time, to > try to extrapolate to the number of servers needed to reach a performance > number. Here are the numbers that we saw: > - 1 Drillbit: 3:45 > - 2 Drillbits: 1:56 > - 3 Drillbits: 1:20 > > The performance flattens out between two and three Drillbits. I was > surprised to see that, given the single file query performance. I was > hoping to throw hardware at the performance a bit more. Is that > surprising to you? > > A somewhat related question. Does Drill take advantage of HDFS locality? > That is, will it send certain fragments to certain boxes because it knows > those boxes have the data replicated locally? Actually in our setup (3 > servers) that might be a moot point assuming every box has all blocks. I'm > not sure if MapRFS changes that. > > I also tried converting the JSON files to Parquet using CTAS. The Parquet > queries took much longer than the JSON queries. Is that expected as well? > > Thanks, > > > > -- > Sent from Gmail Mobile