I am not familiar with the Spark terminology, but I think you more or less
have it correct.

You example query does not necessarily involve a hash-exchange (roughly the
equivalent to a shuffle), because it's possible to run the entire execution
in a single fragment. In this case, it would probably be a hash-aggregate
followed by a sort. No need to redistribute the data. The fact that there
isn't a hash-exchange implies that the planner determined that the dataset
is small enough to need parallelization.

On Thu, Mar 24, 2016 at 7:27 PM, Todd <[email protected]> wrote:

> Hi, Drillers,
>
> I am pretty new to Drill and I am trying to understand the work flow of
> drill query execution.
> When I am reading on the fragment section in
> http://drill.apache.org/docs/drill-query-execution/,  I have some
> questions:
>
> 1. It looks to me that major fragment  is like a spark stage in concept,
> and there will be shuffle between major fragments?
> 2. There will be one or multiple minor fragments within each major
> fragment, it looks to me that it is similar to Spark pipeline in one
> stage(there can be many operators in one stage if there is no shuffle
> involved)
> 3. When I run the Drill locally(start with drill-embedded), and run the
> following query
>
> select name, max(age)  from hdfs.`user`.`/people.json` group by name order
> by name desc limit 2
>
> The above query should  involve shuffle,so that there should be at least 2
> major fragments, but I find that there is only one major fragment from the
> drill web ui.  Not sure whether my above understanding  is right.
>
> Thanks!
>
>
>

Reply via email to