On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <[email protected]>
wrote:

> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>

Ah, okay, awesome. Let me give that a go.



>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Apr 7, 2017, at 7:32 PM, Everett Anderson <[email protected]>
> wrote:
>
> Hi,
>
> Thanks, but that's using a random UUID. Certainly unlikely to have
> collisions, but not guaranteed.
>
> I'd rather prefer something like monotonically_increasing_id or RDD's
> zipWithUniqueId but with better behavioral characteristics -- so they don't
> surprise people when 2+ outputs derived from an original table end up not
> having the same IDs for the same rows, anymore.
>
> It seems like this would be possible under the covers, but would have the
> performance penalty of needing to do perhaps a count() and then also a
> checkpoint.
>
> I was hoping there's a better way.
>
>
> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <[email protected]> wrote:
>
>> http://stackoverflow.com/questions/37231616/add-a-new-column
>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>>
>>
>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> What's the best way to assign a truly unique row ID (rather than a hash)
>>> to a DataFrame/Dataset?
>>>
>>> I originally thought that functions.monotonically_increasing_id would
>>> do this, but it seems to have a rather unfortunate property that if you add
>>> it as a column to table A and then derive tables X, Y, Z and save those,
>>> the row ID values in X, Y, and Z may end up different. I assume this is
>>> because it delays the actual computation to the point where each of those
>>> tables is computed.
>>>
>>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>

Reply via email to