Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 10000 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions.
The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. Best wishes, -- Andrejs Dubovskis