On 8/13/2010 1:58 AM, Jeremy Carroll wrote:
I'm currently importing some files into HBase and am running into an problem
with a large number of store files being created. We have some back data which
is stored in very large sequence files (3-5 Gb in size). When we import this
data the amount of stores created does not get out of hand. When we switch to
smaller sequence files being imported we see that the number of stores rises
quite dramatically. I do not know if this is happening because we are flushing
the commits more frequently with smaller files. I'm wondering if anybody has
any advice regarding this issue. My main concern is during this process we do
not finish flushing to disk (And we set WritetoWal False). We always hit the 90
second timeout due to heavy write load. As these store files pile up, and they
do not get committed to disk, we run into issues where we could lose a lot of
data if something were to crash.
I have created screen shots of or monitoring application for HBase which shows
the spikes in activity.
http://twitpic.com/photos/jeremy_carroll
We faced the similar problem while doing bulk imports. For large number
of reducers, we got large number of small files. Most probably, each
reducer creates one file at the list. Making appropriate number of
reducers and input file size solved the issue.
Lekhnath
This email is intended for the recipient only. If you are not the intended
recipient please disregard, and do not use the information for any purpose.