I am not familiar with the Spark terminology, but I think you more or less have it correct.
You example query does not necessarily involve a hash-exchange (roughly the equivalent to a shuffle), because it's possible to run the entire execution in a single fragment. In this case, it would probably be a hash-aggregate followed by a sort. No need to redistribute the data. The fact that there isn't a hash-exchange implies that the planner determined that the dataset is small enough to need parallelization. On Thu, Mar 24, 2016 at 7:27 PM, Todd <[email protected]> wrote: > Hi, Drillers, > > I am pretty new to Drill and I am trying to understand the work flow of > drill query execution. > When I am reading on the fragment section in > http://drill.apache.org/docs/drill-query-execution/, I have some > questions: > > 1. It looks to me that major fragment is like a spark stage in concept, > and there will be shuffle between major fragments? > 2. There will be one or multiple minor fragments within each major > fragment, it looks to me that it is similar to Spark pipeline in one > stage(there can be many operators in one stage if there is no shuffle > involved) > 3. When I run the Drill locally(start with drill-embedded), and run the > following query > > select name, max(age) from hdfs.`user`.`/people.json` group by name order > by name desc limit 2 > > The above query should involve shuffle,so that there should be at least 2 > major fragments, but I find that there is only one major fragment from the > drill web ui. Not sure whether my above understanding is right. > > Thanks! > > >
