Hello,

I've deployed an HDFS cluster and installed Apache Drill on top of it, but found in my case that It takes quite long for Drill to run some queries on large JSON files, such as the full Reddit submission corpus (260GB). For instance, this query: /SELECT COUNT(*) from dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <> '[deleted]'//; /took about one hour to run. The other thing I've noticed is that none of my queries get processed in a "fragmented" way, the query execution is always in charge of the drilbit acting as the foreman.

In the attachment you can find the topology that I'm using. Any feedback on this would be greatly appreciated.

Thank you very much for your kind attention.

Best regards,

--
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: [email protected], [email protected]
W: www.ibcn.intec.UGent.be

Reply via email to