The story is I run one MR job on my production cluster (0.98.6), it needs to scan one table during map procedure.
Because of the heavy load from the job, all my RS crashed due to OOM. After i restart all RS, i found one problem. All regions were reopened on one RS, and balancer could not run because of two regions were in transition. The cluster got in stuck a long time until i restarted master. 1. why this happened? 2. If cluster has a lots of regions, after all RS crash, how to restart the cluster. If restart RS one by one, it means OOM may happen because one RS has to hold all regions and it will cost a long time. 3. Is it possible to make each table with some requests quotas, it means when one table is requested heavily, it has no impact to other tables on cluster. Thanks
