Yousuf, Which version of drill are you running? Can you share the profile of the query?
Thanks, Dechang On Thu, Dec 15, 2016 at 3:27 AM, yousuf <[email protected]> wrote: > Hello experts > > As a POC project, I've built a drill cluster on 5 VMs , each with the > following specs > > 32 GB ram > > 1 TB storage > > 16 Cores > > Zookeeper quorum & apache drill installed on all 5 nodes. My storage > engine is mongo which has 5 million docs. (Our daily collection is close to > 2.5 million tweets) > > *Problem* > > My aggregation queries are slow, not fit for my realtime dashboard. sample > query as follows took 12.+ secs, > > > SELECT count(*) as cnt, actor_preferred_username from tweets where > posted_time >= '2016-08-01T00.00.00.000Z' and posted_time > <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by > cnt desc limit 10; > > Pls. note that, > > *In Drill* > > 1) I've changed the conf/drill-env.sh on each node with following > > |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export > DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}| > > 2) changed few setting based on the docs <https://drill.apache.org/docs > /sort-based-and-hash-based-memory-constrained-operators/> > > *In Mongo* > > data is sharded on 3 servers on shardkey tweet_id which distributes data > evenly on all shards and created compound index on fields used by the above > drill query. > > > sample data > > { > "_id" : ObjectId("58524d507e08dae4c0377b9e"), > "rule_list" : [ > "A", > "B", > "C", > "D13" > ], > "actor_friends_count" : 40, > "klout_score" : 28, > "actor_favorites_count" : 1697, > "actor_preferred_username" : "_20_xxxx", > "sentiment" : "neu", > "tweet_id" : "tag:search.twitter.com,2005:000xxxxx", > "object_actor_followers_count" : 573, > "actor_posted_time" : "2016-06-24T00:37:54.000Z", > "actor_id" : "id:twitter.com:xxxxxxxxxx", > "actor_display_name" : "xxxxxx", > "retweet_count" : 57, > "hashtag_list" : [ > "c", > "d", > "E", > "f", > "VCVC", > "XXCXCXC", > "RRRR" > ], > "body" : "some tweet blah blah", > "actor_followers_count" : 21, > "actor_status_count" : 1756, > "verb" : "share", > "posted_time" : "2016-08-01T23:47:43.000Z", > "object_actor_status_count" : 2796, > "lang" : "ar", > "object_actor_preferred_username" : "xxxxxxx", > "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx", > "gender" : "male", > "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx", > "favorites_count" : 0, > "object_posted_time" : "2016-08-01T22:54:22.000Z", > "object_actor_friends_count" : 69, > "generator_display_name" : "Twitter Web Client", > "object_actor_display_name" : "xxxxxxxxxxxx", > "actor_listed_count" : 0 > } > > > > *Questions* > > 1) How to improve aggregation query performance? > > 2) Do I also need to do something in mongodb to enhance performance? (I > mean optimize source) > > 3) Does Apache drill capable of handling/aggregating billions of documents > in real-time? > > Your early response is highly appreciated! > > Thank you & Kind Regards > > Yousuf > > > >
