On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:

> That isn't what the original thread is about. The thread is about the
> timestamp portion of the UUID being different.
>
> Having UUID() return the same thing for all rows in a batch would be the
> unexpected thing virtually every time.
> On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com
> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>
>>
>>
>> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com
>> <javascript:_e(%7B%7D,'cvml','j...@jonhaddad.com');>> wrote:
>>
>>> This isn't about using the same UUID though. It's about the timestamp
>>> bits in the UUID.
>>>
>>> What the use case is for generating multiple UUIDs in a single row? Why
>>> do you need to extract the timestamp out of both?
>>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com
>>>> > wrote:
>>>>
>>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>> I am not sure you saw my reply on thread but I believe everyone's
>>>>>> needs can be met I will copy that here:
>>>>>>
>>>>>
>>>>> I saw it, but the real problem that was raised initially was not that
>>>>> of UDF and of allowing both behavior. It's a matter of people being
>>>>> confused by the behavior of a non-UDF function, now(), and suggesting it
>>>>> should be changed.
>>>>>
>>>>> The Hive idea is interesting I guess, and we can switch to discussing
>>>>> that, but it's a different problem really and I'm not a fond of derailing
>>>>> threads. I will just note though that if we're not talking about a
>>>>> confusion issue but rather how to get a timeuuid to be fixed within a
>>>>> statement, then there is much much more trivial solution: generate it
>>>>> client side. The `now()` function is a small convenience but there is
>>>>> nothing you cannot do without it client side, and that actually basically
>>>>> stands for almost any use of (non aggregate) function in Cassandra
>>>>> currently.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> "Food for thought: Hive's UDFs introduced an annotation  
>>>>>> @UDFType(deterministic
>>>>>> = false)
>>>>>>
>>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at-
>>>>>> map-and-reduce-side-in-hive/
>>>>>>
>>>>>> The effect is the query planner can see when such a UDF is in use and
>>>>>> determine the value once at the start of a very long query."
>>>>>>
>>>>>> Essentially hive had a similar if not identical problem, during a
>>>>>> long running distributed process like map/reduce some users wanted the
>>>>>> semantics of:
>>>>>>
>>>>>> 1) Each call should have a new timestamps
>>>>>>
>>>>>> While other users wanted the semantics of:
>>>>>>
>>>>>> 2) Each call should generate the same timestamp
>>>>>>
>>>>>> The solution implemented was to add an annotation to udf such that
>>>>>> the query planner would pick up the annotation and act accordingly.
>>>>>>
>>>>>> (Here is a related issue https://issues.apache.
>>>>>> org/jira/browse/HIVE-1986
>>>>>>
>>>>>> As a result you can essentially implement two UDFS
>>>>>>
>>>>>> @UDFType(deterministic = false)
>>>>>> public class UDFNow
>>>>>>
>>>>>> and for the other people
>>>>>>
>>>>>> @UDFType(deterministic = true)
>>>>>> public class UDFNowOnce extends UDFNow
>>>>>>
>>>>>> Both user cases are met in a sensible way.
>>>>>>
>>>>>
>>>>>
>>>> The `now()` function is a small convenience but there is nothing you
>>>> cannot do without it client side, and that actually basically stands for
>>>> almost any use of (non aggregate) function in Cassandra currently.
>>>>
>>>> Casandra's changing philosophy over which entity should create such
>>>> information client/server/driver does not make this problem easy.
>>>>
>>>> If you take into account that you have users who do not understand all
>>>> the intricacy of uuid the problem is compounded. IE How does one generate a
>>>> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is
>>>> not super easy information to find. Maybe you find a stack overflow post
>>>> that actually gives bad advice etc.
>>>>
>>>> Many times in Cassandra you are using a uuid because you do not have a
>>>> unique key in the insert and you wish to create one. If you are inserting
>>>> more then a single record using that same UUID and you do not want the
>>>> burden of wanting to do it yourself you would have to do write>>read>>write
>>>> which is an anti-pattern.
>>>>
>>>
>> Not multiple ids for a single row. The same id for multiple inserts in a
>> batch.
>>
>> For example lets say I have an application where my data has no unique
>> key.
>>
>> Table poke
>> Poker, pokee, time
>>
>> Suppose i consume pokes from kafka build a batch of 30k and insert them.
>> You probably want to denormalize into two tables:
>> Primary key (poker, time)
>> Primary key (pokee,time)
>>
>> It makes sense that they all have the same uuid if you want it to be the
>> uuid of the batch. This would make it easy to correlate all the events.
>> Easy to delete them all as well.
>>
>> The do it client side argument is totally valid, but has been a
>> justification for not adding features many of which are eventually added
>> anyway.
>>
>>
>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
Debateable.

Cassandra for example always said batch mutations happen.. all at once..but
it was not until snaptree that you could see effects of half a batch. Even
now a multi partition batch does not happen all at once.

What people is expect does not always align with reality. Point me to a
unit test that documents said behaivor and proves it does not change.

Maybe people expect a query planner to fold constants, many people might
think a smart query engine could memorize calls to the same function with
no args, many expect that thinga happen in isolation.


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Reply via email to