We are bulk loading 1 billion rows into hbase. The 1 billion file was split into 20 files of ~22.5GB. Ingesting the file to hdfs took ~2min. Ingesting the first file to hbase took ~3 hours. The next took ~5hours, then it is increasing. By the sixth or seventh file the ingestion just stops (mapReduce Bulk load stops at 99% of mapper and around 22% of the reducer). We also noticed that as soon as the reducers are starting, the progress of the job slows down.
The logs did not show any problem and we do not see any hot spotting (the table is already salted). We are running out of ideas. Few questions to get started: 1- Is the increase MR expected? Does MR need to sort the new data again the already ingested one? 2- Is there a way to speed up this, especially that our data is already sorted? From 2min on hdfs to 5 hours on hbase is a big gap. A word count map reduce on 24GB took only ~7 minutes. Removing the reducers from the existing cvs bulk load will not help as the mappers will spit the data in a random order. regards, Dillon Dillon Chrimes (PhD) University of Victoria Victoria BC Canada
