Just a follow up on this.  After reviewing the physical plans, the
difference appears to be that dir0 is now included as a column in the
parquet-scan and then it is passed to a project for price.

It would be better if even when dir0 is specified as a filter, if it is not
specified in the projection (e.g. select/group/etc.) then it is not
included in the scan.

On Thu, Feb 19, 2015 at 4:09 PM, Adam Gilmore <[email protected]> wrote:

> Hi guys,
>
> I'm trying to understand something about directory partitions and how
> they're implemented.
>
> For sake of basic argument, I have ~3 mil rows in 3 separate Parquet
> files.  Each one has a "groupId" of 1, 2 and 3 respectively.
>
> I then place them in separate directories named 1, 2 and 3.
>
> The following query takes ~250 ms:
>
> SELECT SUM(price) FROM dfs.tmp.test WHERE groupId IN (1, 2, 3)
>
> The following query takes ~500 ms:
>
> SELECT SUM(price) FROM dfs.tmp.test WHERE dir0 IN (1, 2, 3)
>
> Now I imagine the former is grabbing two fields (price and groupId), then
> running the groupId through a filter and then adding price, whereas I would
> imagine the latter would just read price from the 3 Parquet files in those
> matching directories.
>
> I can't understand why this would be twice as slow.  The only thing I can
> imagine is that the dir0..x logic is completely separate and it's doing 3
> very disparate Parquet reads.
>
> To make it more obvious, I moved all Parquet files into the same directory
> and the query takes ~100-150ms.
>
> Any ideas?
>

Reply via email to