Hi All,

I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?

Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)

2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')

Which of the above is better? Or are there better ways to handle this?


Thank you,
Wei

Reply via email to