Here is the query plan JSON for the same query against the Parquet file that I CTASed: https://gist.github.com/sourcedelica/b05eeaf5df9e63b29654. It took 1794.618 seconds.
-- Eric On Tue, Mar 8, 2016 at 1:44 PM, Eric Pederson <eric...@gmail.com> wrote: > Hi everyone - > > Thanks for your feedback. Answers to your questions below. > > The query plan JSON for the JSON query (where the performance flattened > out) is at https://gist.github.com/sourcedelica/e826178a7de7e059fa9a. > This was the plan with all three Drillbits running. > > Some more detail on the structure of the JSON. There are 8 objects at the > first level. The biggest one has a little over 500 fields, the majority of > them being arrays of numbers or arrays of strings. The next biggest group > contains around 300 flat objects. The rest of the groups are fairly small, > 20-40 fields. > > I didn't do any casting in the CTAS from JSON to Parquet due to the sheer > number of fields. :) > > Thanks, > > > > -- Eric > > On Mon, Mar 7, 2016 at 6:02 PM, Eric Pederson <eric...@gmail.com> wrote: > >> We are using MapR M3 and are querying multiple JSON files - around 250 >> files at 1.5 GB per file. We have a small cluster of three machines >> running Drill 1.4. The JSON is nested three-four levels deep, in a format >> like: >> { >> { "group1": >> { "field1": 42, >> "field2: [ "a", "b", "c" ], >> ... >> } >> { "group2": >> .... >> } >> ... >> } >> >> There are about 500 objects like this in each JSON file. >> >> I've been testing a set of queries that scan all of the data (we're >> investigating a partitioning strategy but haven't settled on one that will >> fit all of our queries which are fairly ad-hoc). These full-scan queries >> typically take 1 minute, 20 seconds using the default settings If I limit >> the query to a single file the query takes a few seconds. >> >> I wanted to see how the number of Drillbits would impact the query time, >> to try to extrapolate to the number of servers needed to reach a >> performance number. Here are the numbers that we saw: >> - 1 Drillbit: 3:45 >> - 2 Drillbits: 1:56 >> - 3 Drillbits: 1:20 >> >> The performance flattens out between two and three Drillbits. I was >> surprised to see that, given the single file query performance. I was >> hoping to throw hardware at the performance a bit more. Is that >> surprising to you? >> >> A somewhat related question. Does Drill take advantage of HDFS >> locality? That is, will it send certain fragments to certain boxes because >> it knows those boxes have the data replicated locally? Actually in our >> setup (3 servers) that might be a moot point assuming every box has all >> blocks. I'm not sure if MapRFS changes that. >> >> I also tried converting the JSON files to Parquet using CTAS. The >> Parquet queries took much longer than the JSON queries. Is that expected >> as well? >> >> Thanks, >> >> >> >> -- >> Sent from Gmail Mobile >> > >