On Wed, Jun 7, 2017 at 7:44 PM, Jason Heo <[email protected]> wrote:
> Hi. > > This is a partition strategy of my table. > > PARTITION BY HASH (...) PARTITIONS 40, > RANGE (ymd) ( > PARTITION VALUES < "2015", > PARTITION "2015" <= VALUES < "2016", > PARTITION "2016" <= VALUES < "2017", > PARTITION "2017" <= VALUES > ) > > My concern is that how to manage RANGE(ymd) partitions for greather than > 2017. > > plan 1) using a cron job, add 2018 partition at the end of 2017, add 2019 > partition at the end of 2018, ... > - pros: no unused partitions > - cons: problems arise if next year's partition is not created by > mistake > plan 2) add all upcoming 10 years' partitions > - pros: can reduce risks > - cons: 400 partitions (40*10 years) are created but they has no data > > I prefer to plan 2) but I'm wondering what many unnecessarily partitions > lead to problems. > The empty partitions will increase heartbeat traffic on your nodes, etc. It's not major, but does add a bit of overhead. They will also participate in any queries you run which are not able to prune partitions based on range. Even though they have no data, there would be unnecessary Spark/impala tasks/fragments running against the empty partitions, etc, which may impact performance and concurrency. I think I'd suggest plan 1, plus also put it on several people's calendars to verify :) Alternatively, something like in 2017 add the partitions for 2018 and 2019, so you always maintain one extra year ahead and you are less likely to "not notice" if the new one is not created in time. -Todd -- Todd Lipcon Software Engineer, Cloudera
