That's great Chris! I'll try with parquet then. Thank you very much for your help!

Best,

Leandro


On 05/17/2016 03:52 PM, Christopher Matta wrote:
Leandro,
I ran into a similar situation while building this demo:
https://github.com/cjmatta/DrillPandasReddit/blob/master/Reddit%20Drill%20Pandas.ipynb

I don't think Drill splits single JSON files the way it does for delimited
one-record-per-line files, so that would explain why you're seeing
single-threaded processing.

If you look I ended up extracting the data I was concerned with by creating
Parquet files using a CTAS statement, this could potentially be helpful for
you because Parquet is significantly smaller than JSON (I've observed a 10x
storage savings) and will also be able to be split by Drill.

--
Chris Matta
215-701-3146
[email protected]

On Tue, May 17, 2016 at 9:41 AM, Jim Scott <[email protected]> wrote:

The mailing lists do not support attachments. You can provide a link to a
git repo or something like that though.

You might want to alter your query to be something like select
count(FIELDX) from....

On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
[email protected]> wrote:

Hello,

I've deployed an HDFS cluster and installed Apache Drill on top of it,
but
found in my case that It takes quite long for Drill to run some queries
on
large JSON files, such as the full Reddit submission corpus (260GB). For
instance, this query: *SELECT COUNT(*) from
dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
'[deleted]'**; *took about one hour to run. The other thing I've noticed
is that none of my queries get processed in a "fragmented" way, the query
execution is always in charge of the drilbit acting as the foreman.

In the attachment you can find the topology that I'm using. Any feedback
on this would be greatly appreciated.

Thank you very much for your kind attention.

Best regards,

--
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: [email protected], [email protected]
W: www.ibcn.intec.UGent.be



--
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>

<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Now Available - Free Hadoop On-Demand Training
<
http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available

--
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: [email protected], [email protected]
W: www.ibcn.intec.UGent.be

Reply via email to