Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables.
Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson <[email protected]> wrote: > > Hi, > > Thanks, but that's using a random UUID. Certainly unlikely to have > collisions, but not guaranteed. > > I'd rather prefer something like monotonically_increasing_id or RDD's > zipWithUniqueId but with better behavioral characteristics -- so they don't > surprise people when 2+ outputs derived from an original table end up not > having the same IDs for the same rows, anymore. > > It seems like this would be possible under the covers, but would have the > performance penalty of needing to do perhaps a count() and then also a > checkpoint. > > I was hoping there's a better way. > > >> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <[email protected]> wrote: >> http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator >> >> >>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <[email protected]> >>> wrote: >>> Hi, >>> >>> What's the best way to assign a truly unique row ID (rather than a hash) to >>> a DataFrame/Dataset? >>> >>> I originally thought that functions.monotonically_increasing_id would do >>> this, but it seems to have a rather unfortunate property that if you add it >>> as a column to table A and then derive tables X, Y, Z and save those, the >>> row ID values in X, Y, and Z may end up different. I assume this is because >>> it delays the actual computation to the point where each of those tables is >>> computed. >>> >> >> >> >> -- >> >> -- >> Thanks, >> >> Tim >
