Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want
Le mer. 26 août 2015 à 18:12, Thomas Dudziak <tom...@gmail.com> a écrit : > Sorry, I meant without reading from all splits. This is a single partition > in the table. > > On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak <tom...@gmail.com> wrote: > >> I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows >> from and I don't particularly care which rows. Doing a LIMIT unfortunately >> results in two stages where the first stage reads the whole table, and the >> second then performs the limit with a single worker, which is not very >> efficient. >> Is there a better way to sample a subset of rows in Spark without, >> ideally in a single stage without reading all partitions. >> >> cheers, >> Tom >> > >