Bryan Beaudreault created HBASE-28440:
-----------------------------------------

             Summary: Add support for using mapreduce sort in HFileOutputFormat2
                 Key: HBASE-28440
                 URL: https://issues.apache.org/jira/browse/HBASE-28440
             Project: HBase
          Issue Type: Improvement
          Components: backup&restore
            Reporter: Bryan Beaudreault


Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all 
of the cells of a row in memory using a TreeSet. There is a warning in the 
javadoc "If lots of columns per row, it will use lots of memory sorting." This 
can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have 
reasonably sized row which just gets lots of edits in the time period of WALs 
being replayed, and that would cause an OOM. We are seeing this in some cases 
with incremental backups.

MapReduce has built-in sorting capabilities which are not limited to sorting in 
memory. It can spill to disk as necessary to sort very large datasets. We can 
get this capability in HFileOutputFormat2 with a couple changes:
 # Add support for a KeyOnlyCellComparable type as the map output key
 # When configured, use 
job.setSortComparatorClass(CellWritableComparator.class) and 
job.setReducerClass(PreSortedCellsReducer.class)
 # Update WALPlayer to have a mode which can output this new comparable instead 
of ImmutableBytesWritable

CellWritableComparator exists already for the Import job, so there is some 
prior art. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to