Hi,

Thanks, but that's using a random UUID. Certainly unlikely to have
collisions, but not guaranteed.

I'd rather prefer something like monotonically_increasing_id or RDD's
zipWithUniqueId but with better behavioral characteristics -- so they don't
surprise people when 2+ outputs derived from an original table end up not
having the same IDs for the same rows, anymore.

It seems like this would be possible under the covers, but would have the
performance penalty of needing to do perhaps a count() and then also a
checkpoint.

I was hoping there's a better way.


On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <[email protected]> wrote:

> http://stackoverflow.com/questions/37231616/add-a-new-
> column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>
>
> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <[email protected]
> > wrote:
>
>> Hi,
>>
>> What's the best way to assign a truly unique row ID (rather than a hash)
>> to a DataFrame/Dataset?
>>
>> I originally thought that functions.monotonically_increasing_id would do
>> this, but it seems to have a rather unfortunate property that if you add it
>> as a column to table A and then derive tables X, Y, Z and save those, the
>> row ID values in X, Y, and Z may end up different. I assume this is because
>> it delays the actual computation to the point where each of those tables is
>> computed.
>>
>>
>
>
> --
>
> --
> Thanks,
>
> Tim
>

Reply via email to