Hi All,
I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?
Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)
2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')
Which of the above is better? Or are there better ways to handle this?
Thank you,
Wei