Thank you Jim,
The attachment was this image: https://i.imgsafe.org/7e98f92.png
Then, is it expected for the query I've mentioned before to take that long?
On 05/17/2016 03:41 PM, Jim Scott wrote:
The mailing lists do not support attachments. You can provide a link to a
git repo or something like that though.
You might want to alter your query to be something like select
count(FIELDX) from....
On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
[email protected]> wrote:
Hello,
I've deployed an HDFS cluster and installed Apache Drill on top of it, but
found in my case that It takes quite long for Drill to run some queries on
large JSON files, such as the full Reddit submission corpus (260GB). For
instance, this query: *SELECT COUNT(*) from
dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
'[deleted]'**; *took about one hour to run. The other thing I've noticed
is that none of my queries get processed in a "fragmented" way, the query
execution is always in charge of the drilbit acting as the foreman.
In the attachment you can find the topology that I'm using. Any feedback
on this would be greatly appreciated.
Thank you very much for your kind attention.
Best regards,
--
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: [email protected], [email protected]
W: www.ibcn.intec.UGent.be
--
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: [email protected], [email protected]
W: www.ibcn.intec.UGent.be