Re: Efficient sampling from a Hive table

Thomas Dudziak Wed, 26 Aug 2015 11:13:59 -0700

Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes
to discover all split files on all hosts (for some reason) before it even
starts the job, and then it creates 3.5 million tasks (the partition has
~32k split files).


On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke <jornfra...@gmail.com> wrote:

>
> Have you tried tablesample? You find the exact syntax in the
> documentation, but it exlxactly does what you want
>
> Le mer. 26 août 2015 à 18:12, Thomas Dudziak <tom...@gmail.com> a écrit :
>
>> Sorry, I meant without reading from all splits. This is a single
>> partition in the table.
>>
>> On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak <tom...@gmail.com> wrote:
>>
>>> I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows
>>> from and I don't particularly care which rows. Doing a LIMIT unfortunately
>>> results in two stages where the first stage reads the whole table, and the
>>> second then performs the limit with a single worker, which is not very
>>> efficient.
>>> Is there a better way to sample a subset of rows in Spark without,
>>> ideally in a single stage without reading all partitions.
>>>
>>> cheers,
>>> Tom
>>>
>>
>>

Re: Efficient sampling from a Hive table

Reply via email to