Re: Performance tuning

Christopher Matta Tue, 17 May 2016 06:53:52 -0700

Leandro,
I ran into a similar situation while building this demo:
https://github.com/cjmatta/DrillPandasReddit/blob/master/Reddit%20Drill%20Pandas.ipynb


I don't think Drill splits single JSON files the way it does for delimited
one-record-per-line files, so that would explain why you're seeing
single-threaded processing.

If you look I ended up extracting the data I was concerned with by creating
Parquet files using a CTAS statement, this could potentially be helpful for
you because Parquet is significantly smaller than JSON (I've observed a 10x
storage savings) and will also be able to be split by Drill.

--
Chris Matta
215-701-3146
[email protected]

On Tue, May 17, 2016 at 9:41 AM, Jim Scott <[email protected]> wrote:

> The mailing lists do not support attachments. You can provide a link to a
> git repo or something like that though.
>
> You might want to alter your query to be something like select
> count(FIELDX) from....
>
> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
> [email protected]> wrote:
>
> > Hello,
> >
> > I've deployed an HDFS cluster and installed Apache Drill on top of it,
> but
> > found in my case that It takes quite long for Drill to run some queries
> on
> > large JSON files, such as the full Reddit submission corpus (260GB). For
> > instance, this query: *SELECT COUNT(*) from
> > dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
> > '[deleted]'**; *took about one hour to run. The other thing I've noticed
> > is that none of my queries get processed in a "fragmented" way, the query
> > execution is always in charge of the drilbit acting as the foreman.
> >
> > In the attachment you can find the topology that I'm using. Any feedback
> > on this would be greatly appreciated.
> >
> > Thank you very much for your kind attention.
> >
> > Best regards,
> >
> > --
> > Leandro Ordonez-Ante
> > Department of Information Technology
> > Internet Based Communication Networks and Services (IBCN)
> > Ghent University - iMinds
> > Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
> > E: [email protected], [email protected]
> > W: www.ibcn.intec.UGent.be
> >
> >
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
> +1 (347) 746-9281
> @kingmesal <https://twitter.com/kingmesal>
>
> <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Performance tuning

Reply via email to