Re: Varying Execution Times For The Same Query On The Same File

Steven Phillips Fri, 16 Jan 2015 16:51:54 -0800

json files are not splittable. There will be exactly one thread reading the
file, regardless of how big it is.


On Fri, Jan 16, 2015 at 4:15 PM, George Chow <[email protected]> wrote:

> It should be possible to compare your HDFS block size with your file size
> to determine how many blocks (and hence nodes) the file spans.
>
> Is my understanding sound?
>
> George
>
>
> On Fri, Jan 16, 2015 at 11:52 AM, Ted Dunning <[email protected]>
> wrote:
>
> > If you do want to have more parallelism, use several input files.
> >
> >
> > On Fri, Jan 16, 2015 at 9:13 AM, Jason Altekruse <
> [email protected]
> > >
> > wrote:
> >
> > > I do not think we currently consider JSON files splittable. If we do
> > treat
> > > them as such, it would depend on the file size and the available read
> > > locality available on the nodes. Especially with a select * (or a
> > count(*))
> > > query there is nothing to parallelize except for the read operation
> and a
> > > simple aggregation. Spreading a small read throughout the cluster would
> > > only guarantee that some of the reads would happen over the wire, only
> to
> > > have the final aggregation to be sent later to the query's head node.
> > >
> > > On Fri, Jan 16, 2015 at 3:19 AM, mufy <[email protected]> wrote:
> > >
> > > > And what would be the best way of ensuring that all the drill-bit
> nodes
> > > > participated in the query execution?
> > > >
> > > >
> > > > ---
> > > > Mufeed Usman
> > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> |
> My
> > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> LiveJournal
> > > > <http://mufeed.livejournal.com>
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 16, 2015 at 4:45 PM, Steven Phillips <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > I would guess that for the first run, data had to be read off disk,
> > > plus
> > > > > code runtime code had to be compiled. Subsequent runs did not need
> to
> > > do
> > > > > this, since the data should then be in cache, as well as the
> compiled
> > > > > classes, so the subsequent runs are noticeably faster. Runs 1 - 4
> > have
> > > a
> > > > > range of about 1.5 seconds, which seems like an unremarkable amount
> > of
> > > > > noise.
> > > > >
> > > > > On Fri, Jan 16, 2015 at 3:07 AM, mufy <[email protected]>
> > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I was curious to know the possible reason(s) behind the
> difference
> > in
> > > > > > timings observed as shown below:
> > > > > >
> > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | 1125458    |
> > > > > > +------------+
> > > > > > 1 row selected (15.214 seconds)
> > > > > >
> > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | 1125458    |
> > > > > > +------------+
> > > > > > 1 row selected (12.717 seconds)
> > > > > >
> > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | 1125458    |
> > > > > > +------------+
> > > > > > 1 row selected (11.833 seconds)
> > > > > >
> > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | 1125458    |
> > > > > > +------------+
> > > > > > 1 row selected (13.298 seconds)
> > > > > >
> > > > > > 0: jdbc:drill:zk=> select count(*) from
> > > > > > dfs.tmp.`yelp_academic_dataset_review.json`;
> > > > > > +------------+
> > > > > > |   EXPR$0   |
> > > > > > +------------+
> > > > > > | 1125458    |
> > > > > > +------------+
> > > > > > 1 row selected (12.749 seconds)
> > > > > >
> > > > > > This was run using MapR Drill 0.7.0 on a 5 node MapR cluster.
> > > > > >
> > > > > >
> > > > > > ---
> > > > > > Mufeed Usman
> > > > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400
> >
> > |
> > > My
> > > > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> > > LiveJournal
> > > > > > <http://mufeed.livejournal.com>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >  Steven Phillips
> > > > >  Software Engineer
> > > > >
> > > > >  mapr.com
> > > > >
> > > >
> > >
> >
>
>
>
> --
> --
> "Not everything that can be counted counts, and not everything that counts
> can be counted." Albert Einstein
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Varying Execution Times For The Same Query On The Same File

Reply via email to