Efficient sampling from a Hive table

Thomas Dudziak Wed, 26 Aug 2015 08:55:17 -0700

I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from
and I don't particularly care which rows. Doing a LIMIT unfortunately
results in two stages where the first stage reads the whole table, and the
second then performs the limit with a single worker, which is not very
efficient.
Is there a better way to sample a subset of rows in Spark without, ideally
in a single stage without reading all partitions.


cheers,
Tom

Efficient sampling from a Hive table

Reply via email to