Some problems in one accident on my production cluster

Heng Chen Wed, 24 Feb 2016 15:31:44 -0800

The story is I run one MR job on my production cluster (0.98.6),   it needs
to scan one table during map procedure.


Because of the heavy load from the job,  all my RS crashed due to OOM.

After i restart all RS,  i found one problem.

All regions were reopened on one RS,  and balancer could not run because of
two regions were in transition.   The cluster got in stuck a long time
until i restarted master.

1.  why this happened?

2.  If cluster has a lots of regions, after all RS crash,  how to restart
the cluster.  If restart RS one by one, it means OOM may happen because one
RS has to hold all regions and it will cost a long time.

3.  Is it possible to make each table with some requests quotas,  it means
when one table is requested heavily, it has no impact to other tables on
cluster.


Thanks

Some problems in one accident on my production cluster

Reply via email to