The problem seems related to sampling, a short answer would be based on Spark 

If RDD.sample is still too slow for your requirement, then maybe is the direction to 
investigate, but not sure any existing implementation yet.

Reservoir sampling - Wikipedia<>
Reservoir sampling is a family of randomized algorithms for randomly choosing a 
sample of k items from a list S containing n items, where n is either a very 
large or unknown number.

From: Liu, Ming (Ming) <>
Sent: Friday, April 13, 2018 12:16:07 AM
Subject: how to get random rows from a big hbase table faster

Hi, all,

We have a hbase table which has 1 billion rows, and we want to randomly get 1M 
from that table. We are now trying the RandomRowFilter, but it is still very 
slow. If I understand it correctly, in the Server side, RandomRowFilter still 
need to read all 1 billions but return randomly 1% for them. But read 1 billion 
rows is very slow. Is this true?

So is there any other better way to randomly get 1% rows from a given table? 
Any idea will be very appreciated.
We don't know the distribution of the 1 billion rows in advance.


