json files are not splittable. There will be exactly one thread reading the file, regardless of how big it is.
On Fri, Jan 16, 2015 at 4:15 PM, George Chow <[email protected]> wrote: > It should be possible to compare your HDFS block size with your file size > to determine how many blocks (and hence nodes) the file spans. > > Is my understanding sound? > > George > > > On Fri, Jan 16, 2015 at 11:52 AM, Ted Dunning <[email protected]> > wrote: > > > If you do want to have more parallelism, use several input files. > > > > > > On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse < > [email protected] > > > > > wrote: > > > > > I do not think we currently consider JSON files splittable. If we do > > treat > > > them as such, it would depend on the file size and the available read > > > locality available on the nodes. Especially with a select * (or a > > count(*)) > > > query there is nothing to parallelize except for the read operation > and a > > > simple aggregation. Spreading a small read throughout the cluster would > > > only guarantee that some of the reads would happen over the wire, only > to > > > have the final aggregation to be sent later to the query's head node. > > > > > > On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]> wrote: > > > > > > > And what would be the best way of ensuring that all the drill-bit > nodes > > > > participated in the query execution? > > > > > > > > > > > > --- > > > > Mufeed Usman > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | > My > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > LiveJournal > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > > > > > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips < > > [email protected] > > > > > > > > wrote: > > > > > > > > > I would guess that for the first run, data had to be read off disk, > > > plus > > > > > code runtime code had to be compiled. Subsequent runs did not need > to > > > do > > > > > this, since the data should then be in cache, as well as the > compiled > > > > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4 > > have > > > a > > > > > range of about 1.5 seconds, which seems like an unremarkable amount > > of > > > > > noise. > > > > > > > > > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I was curious to know the possible reason(s) behind the > difference > > in > > > > > > timings observed as shown below: > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > +------------+ > > > > > > | EXPR$0 | > > > > > > +------------+ > > > > > > | 1125458 | > > > > > > +------------+ > > > > > > 1 row selected (15.214 seconds) > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > +------------+ > > > > > > | EXPR$0 | > > > > > > +------------+ > > > > > > | 1125458 | > > > > > > +------------+ > > > > > > 1 row selected (12.717 seconds) > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > +------------+ > > > > > > | EXPR$0 | > > > > > > +------------+ > > > > > > | 1125458 | > > > > > > +------------+ > > > > > > 1 row selected (11.833 seconds) > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > +------------+ > > > > > > | EXPR$0 | > > > > > > +------------+ > > > > > > | 1125458 | > > > > > > +------------+ > > > > > > 1 row selected (13.298 seconds) > > > > > > > > > > > > 0: jdbc:drill:zk=> select count(*) from > > > > > > dfs.tmp.`yelp_academic_dataset_review.json`; > > > > > > +------------+ > > > > > > | EXPR$0 | > > > > > > +------------+ > > > > > > | 1125458 | > > > > > > +------------+ > > > > > > 1 row selected (12.749 seconds) > > > > > > > > > > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster. > > > > > > > > > > > > > > > > > > --- > > > > > > Mufeed Usman > > > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400 > > > > | > > > My > > > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs : > > > LiveJournal > > > > > > <http://mufeed.livejournal.com> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Steven Phillips > > > > > Software Engineer > > > > > > > > > > mapr.com > > > > > > > > > > > > > > > > > > -- > -- > "Not everything that can be counted counts, and not everything that counts > can be counted." Albert Einstein > -- Steven Phillips Software Engineer mapr.com
