On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram <[email protected]> wrote:
> Hi, > > We use monotonically_increasing_id() as well, but just cache the table > first like Ankur suggested. With that method, we get the same keys in all > derived tables. > Ah, okay, awesome. Let me give that a go. > > Thanks, > Subhash > > Sent from my iPhone > > On Apr 7, 2017, at 7:32 PM, Everett Anderson <[email protected]> > wrote: > > Hi, > > Thanks, but that's using a random UUID. Certainly unlikely to have > collisions, but not guaranteed. > > I'd rather prefer something like monotonically_increasing_id or RDD's > zipWithUniqueId but with better behavioral characteristics -- so they don't > surprise people when 2+ outputs derived from an original table end up not > having the same IDs for the same rows, anymore. > > It seems like this would be possible under the covers, but would have the > performance penalty of needing to do perhaps a count() and then also a > checkpoint. > > I was hoping there's a better way. > > > On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith <[email protected]> wrote: > >> http://stackoverflow.com/questions/37231616/add-a-new-column >> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator >> >> >> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson < >> [email protected]> wrote: >> >>> Hi, >>> >>> What's the best way to assign a truly unique row ID (rather than a hash) >>> to a DataFrame/Dataset? >>> >>> I originally thought that functions.monotonically_increasing_id would >>> do this, but it seems to have a rather unfortunate property that if you add >>> it as a column to table A and then derive tables X, Y, Z and save those, >>> the row ID values in X, Y, and Z may end up different. I assume this is >>> because it delays the actual computation to the point where each of those >>> tables is computed. >>> >>> >> >> >> -- >> >> -- >> Thanks, >> >> Tim >> > >
