On Saturday, December 3, 2016, Jonathan Haddad <j...@jonhaddad.com> wrote:
> That isn't what the original thread is about. The thread is about the > timestamp portion of the UUID being different. > > Having UUID() return the same thing for all rows in a batch would be the > unexpected thing virtually every time. > On Sat, Dec 3, 2016 at 7:09 AM Edward Capriolo <edlinuxg...@gmail.com > <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote: > >> >> >> On Friday, December 2, 2016, Jonathan Haddad <j...@jonhaddad.com >> <javascript:_e(%7B%7D,'cvml','j...@jonhaddad.com');>> wrote: >> >>> This isn't about using the same UUID though. It's about the timestamp >>> bits in the UUID. >>> >>> What the use case is for generating multiple UUIDs in a single row? Why >>> do you need to extract the timestamp out of both? >>> On Fri, Dec 2, 2016 at 10:24 AM Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> >>>> >>>> On Thu, Dec 1, 2016 at 11:09 AM, Sylvain Lebresne <sylv...@datastax.com >>>> > wrote: >>>> >>>>> On Thu, Dec 1, 2016 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com >>>>> > wrote: >>>>> >>>>>> >>>>>> I am not sure you saw my reply on thread but I believe everyone's >>>>>> needs can be met I will copy that here: >>>>>> >>>>> >>>>> I saw it, but the real problem that was raised initially was not that >>>>> of UDF and of allowing both behavior. It's a matter of people being >>>>> confused by the behavior of a non-UDF function, now(), and suggesting it >>>>> should be changed. >>>>> >>>>> The Hive idea is interesting I guess, and we can switch to discussing >>>>> that, but it's a different problem really and I'm not a fond of derailing >>>>> threads. I will just note though that if we're not talking about a >>>>> confusion issue but rather how to get a timeuuid to be fixed within a >>>>> statement, then there is much much more trivial solution: generate it >>>>> client side. The `now()` function is a small convenience but there is >>>>> nothing you cannot do without it client side, and that actually basically >>>>> stands for almost any use of (non aggregate) function in Cassandra >>>>> currently. >>>>> >>>>> >>>>>> >>>>>> >>>>>> "Food for thought: Hive's UDFs introduced an annotation >>>>>> @UDFType(deterministic >>>>>> = false) >>>>>> >>>>>> http://dmtolpeko.com/2014/10/15/invoking-stateful-udf-at- >>>>>> map-and-reduce-side-in-hive/ >>>>>> >>>>>> The effect is the query planner can see when such a UDF is in use and >>>>>> determine the value once at the start of a very long query." >>>>>> >>>>>> Essentially hive had a similar if not identical problem, during a >>>>>> long running distributed process like map/reduce some users wanted the >>>>>> semantics of: >>>>>> >>>>>> 1) Each call should have a new timestamps >>>>>> >>>>>> While other users wanted the semantics of: >>>>>> >>>>>> 2) Each call should generate the same timestamp >>>>>> >>>>>> The solution implemented was to add an annotation to udf such that >>>>>> the query planner would pick up the annotation and act accordingly. >>>>>> >>>>>> (Here is a related issue https://issues.apache. >>>>>> org/jira/browse/HIVE-1986 >>>>>> >>>>>> As a result you can essentially implement two UDFS >>>>>> >>>>>> @UDFType(deterministic = false) >>>>>> public class UDFNow >>>>>> >>>>>> and for the other people >>>>>> >>>>>> @UDFType(deterministic = true) >>>>>> public class UDFNowOnce extends UDFNow >>>>>> >>>>>> Both user cases are met in a sensible way. >>>>>> >>>>> >>>>> >>>> The `now()` function is a small convenience but there is nothing you >>>> cannot do without it client side, and that actually basically stands for >>>> almost any use of (non aggregate) function in Cassandra currently. >>>> >>>> Casandra's changing philosophy over which entity should create such >>>> information client/server/driver does not make this problem easy. >>>> >>>> If you take into account that you have users who do not understand all >>>> the intricacy of uuid the problem is compounded. IE How does one generate a >>>> UUID each c#, python, java etc? with the 47 random bits of bla bla. That is >>>> not super easy information to find. Maybe you find a stack overflow post >>>> that actually gives bad advice etc. >>>> >>>> Many times in Cassandra you are using a uuid because you do not have a >>>> unique key in the insert and you wish to create one. If you are inserting >>>> more then a single record using that same UUID and you do not want the >>>> burden of wanting to do it yourself you would have to do write>>read>>write >>>> which is an anti-pattern. >>>> >>> >> Not multiple ids for a single row. The same id for multiple inserts in a >> batch. >> >> For example lets say I have an application where my data has no unique >> key. >> >> Table poke >> Poker, pokee, time >> >> Suppose i consume pokes from kafka build a batch of 30k and insert them. >> You probably want to denormalize into two tables: >> Primary key (poker, time) >> Primary key (pokee,time) >> >> It makes sense that they all have the same uuid if you want it to be the >> uuid of the batch. This would make it easy to correlate all the events. >> Easy to delete them all as well. >> >> The do it client side argument is totally valid, but has been a >> justification for not adding features many of which are eventually added >> anyway. >> >> >> >> >> -- >> Sorry this was sent from mobile. Will do less grammar and spell check >> than usual. >> > Debateable. Cassandra for example always said batch mutations happen.. all at once..but it was not until snaptree that you could see effects of half a batch. Even now a multi partition batch does not happen all at once. What people is expect does not always align with reality. Point me to a unit test that documents said behaivor and proves it does not change. Maybe people expect a query planner to fold constants, many people might think a smart query engine could memorize calls to the same function with no args, many expect that thinga happen in isolation. -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.