Leandro, I ran into a similar situation while building this demo: https://github.com/cjmatta/DrillPandasReddit/blob/master/Reddit%20Drill%20Pandas.ipynb
I don't think Drill splits single JSON files the way it does for delimited one-record-per-line files, so that would explain why you're seeing single-threaded processing. If you look I ended up extracting the data I was concerned with by creating Parquet files using a CTAS statement, this could potentially be helpful for you because Parquet is significantly smaller than JSON (I've observed a 10x storage savings) and will also be able to be split by Drill. -- Chris Matta 215-701-3146 [email protected] On Tue, May 17, 2016 at 9:41 AM, Jim Scott <[email protected]> wrote: > The mailing lists do not support attachments. You can provide a link to a > git repo or something like that though. > > You might want to alter your query to be something like select > count(FIELDX) from.... > > On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez < > [email protected]> wrote: > > > Hello, > > > > I've deployed an HDFS cluster and installed Apache Drill on top of it, > but > > found in my case that It takes quite long for Drill to run some queries > on > > large JSON files, such as the full Reddit submission corpus (260GB). For > > instance, this query: *SELECT COUNT(*) from > > dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <> > > '[deleted]'**; *took about one hour to run. The other thing I've noticed > > is that none of my queries get processed in a "fragmented" way, the query > > execution is always in charge of the drilbit acting as the foreman. > > > > In the attachment you can find the topology that I'm using. Any feedback > > on this would be greatly appreciated. > > > > Thank you very much for your kind attention. > > > > Best regards, > > > > -- > > Leandro Ordonez-Ante > > Department of Information Technology > > Internet Based Communication Networks and Services (IBCN) > > Ghent University - iMinds > > Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium > > E: [email protected], [email protected] > > W: www.ibcn.intec.UGent.be > > > > > > > -- > *Jim Scott* > Director, Enterprise Strategy & Architecture > +1 (347) 746-9281 > @kingmesal <https://twitter.com/kingmesal> > > <http://www.mapr.com/> > [image: MapR Technologies] <http://www.mapr.com> > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
