1% from 1B is 10M. 10M random reads is doable if :
a. Cluster is sufficiently large
b. equipped with SSDs
c. you run multiple clients in parallel to retrieve these rows
You need to know in advance min/max rows in a table,
then generate randomly start row and open scanner with this start row, then
just read first KV
Or, say split min/max row region into N consecutive sub-regions (N is up to
you) and open N scanners with RandomRowFilter
again, you have to run N clients (or threads) to do this in parallel
On Thu, Apr 12, 2018 at 9:16 AM, Liu, Ming (Ming) <ming....@esgyn.cn> wrote:
> Hi, all,
> We have a hbase table which has 1 billion rows, and we want to randomly
> get 1M from that table. We are now trying the RandomRowFilter, but it is
> still very slow. If I understand it correctly, in the Server side,
> RandomRowFilter still need to read all 1 billions but return randomly 1%
> for them. But read 1 billion rows is very slow. Is this true?
> So is there any other better way to randomly get 1% rows from a given
> table? Any idea will be very appreciated.
> We don't know the distribution of the 1 billion rows in advance.