Re: Generating unique id for a column in Row without breaking into RDD and joining back

Mich Talebzadeh Fri, 05 Aug 2016 09:38:59 -0700

On the same token can one generate  a UUID like below in Hive

hive> select reflect("java.util.UUID", "randomUUID");
OK
587b1665-b578-4124-8bf9-8b17ccb01fe7


thx

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 August 2016 at 17:34, Mike Metzger <m...@flexiblecreations.com> wrote:

> Tony -
>
>    From my testing this is built with performance in mind.  It's a 64-bit
> value split between the partition id (upper 31 bits ~1billion) and the id
> counter within a partition (lower 33 bits ~8 billion).  There shouldn't be
> any added communication between the executors and the driver for that.
>
> I've been toying with an implementation that allows you to specify the
> split for better control along with a start value.
>
> Thanks
>
> Mike
>
> On Aug 5, 2016, at 11:07 AM, Tony Lane <tonylane....@gmail.com> wrote:
>
> Mike.
>
> I have figured how to do this .  Thanks for the suggestion. It works
> great.  I am trying to figure out the performance impact of this.
>
> thanks again
>
>
> On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane....@gmail.com> wrote:
>
>> @mike  - this looks great. How can i do this in java ?   what is the
>> performance implication on a large dataset  ?
>>
>> @sonal  - I can't have a collision in the values.
>>
>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com>
>> wrote:
>>
>>> You can use the monotonically_increasing_id method to generate
>>> guaranteed unique (but not necessarily consecutive) IDs.  Calling something
>>> like:
>>>
>>> df.withColumn("id", monotonically_increasing_id())
>>>
>>> You don't mention which language you're using but you'll need to pull in
>>> the sql.functions library.
>>>
>>> Mike
>>>
>>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane....@gmail.com> wrote:
>>>
>>> Ayan - basically i have a dataset with structure, where bid are unique
>>> string values
>>>
>>> bid: String
>>> val : integer
>>>
>>> I need unique int values for these string bid''s to do some processing
>>> in the dataset
>>>
>>> like
>>>
>>> id:int   (unique integer id for each bid)
>>> bid:String
>>> val:integer
>>>
>>>
>>>
>>> -Tony
>>>
>>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Can you explain a little further?
>>>>
>>>> best
>>>> Ayan
>>>>
>>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane....@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a row with structure like
>>>>>
>>>>> identifier: String
>>>>> value: int
>>>>>
>>>>> All identifier are unique and I want to generate a unique long id for
>>>>> the data and get a row object back for further processing.
>>>>>
>>>>> I understand using the zipWithUniqueId function on RDD, but that would
>>>>> mean first converting to RDD and then joining back the RDD and dataset
>>>>>
>>>>> What is the best way to do this ?
>>>>>
>>>>> -Tony
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>

Re: Generating unique id for a column in Row without breaking into RDD and joining back

Reply via email to