Just a follow up on this. After reviewing the physical plans, the difference appears to be that dir0 is now included as a column in the parquet-scan and then it is passed to a project for price.
It would be better if even when dir0 is specified as a filter, if it is not specified in the projection (e.g. select/group/etc.) then it is not included in the scan. On Thu, Feb 19, 2015 at 4:09 PM, Adam Gilmore <[email protected]> wrote: > Hi guys, > > I'm trying to understand something about directory partitions and how > they're implemented. > > For sake of basic argument, I have ~3 mil rows in 3 separate Parquet > files. Each one has a "groupId" of 1, 2 and 3 respectively. > > I then place them in separate directories named 1, 2 and 3. > > The following query takes ~250 ms: > > SELECT SUM(price) FROM dfs.tmp.test WHERE groupId IN (1, 2, 3) > > The following query takes ~500 ms: > > SELECT SUM(price) FROM dfs.tmp.test WHERE dir0 IN (1, 2, 3) > > Now I imagine the former is grabbing two fields (price and groupId), then > running the groupId through a filter and then adding price, whereas I would > imagine the latter would just read price from the 3 Parquet files in those > matching directories. > > I can't understand why this would be twice as slow. The only thing I can > imagine is that the dir0..x logic is completely separate and it's doing 3 > very disparate Parquet reads. > > To make it more obvious, I moved all Parquet files into the same directory > and the query takes ~100-150ms. > > Any ideas? >
