Re: Interesting use case

kurt Greaves Fri, 10 Jun 2016 07:55:13 -0700

woops was obviously tired, what I said clearly doesn't make sense.

On 10 June 2016 at 14:52, kurt Greaves <k...@instaclustr.com> wrote:


> Sorry, I did mean larger number of rows per partition.
>
> On 9 June 2016 at 10:12, John Thomas <jthom...@gmail.com> wrote:
>
>> The example I gave was for when N=1, if we need to save more values I
>> planned to just add more columns.
>>
>> On Thu, Jun 9, 2016 at 12:51 AM, kurt Greaves <k...@instaclustr.com>
>> wrote:
>>
>>> I would say it's probably due to a significantly larger number of
>>> partitions when using the overwrite method - but really you should be
>>> seeing similar performance unless one of the schemas ends up generating a
>>> lot more disk IO.
>>> If you're planning to read the last N values for an event at the same
>>> time the widerow schema would be better, otherwise reading N events using
>>> the overwrite schema will result in you hitting N partitions. You really
>>> need to take into account how you're going to read the data when you design
>>> a schema, not only how many writes you can push through.
>>>
>>> On 8 June 2016 at 19:02, John Thomas <jthom...@gmail.com> wrote:
>>>
>>>> We have a use case where we are storing event data for a given system
>>>> and only want to retain the last N values.  Storing extra values for some
>>>> time, as long as it isn’t too long, is fine but never less than N.  We
>>>> can't use TTLs to delete the data because we can't be sure how frequently
>>>> events will arrive and could end up losing everything.  Is there any built
>>>> in mechanism to accomplish this or a known pattern that we can follow?  The
>>>> events will be read and written at a pretty high frequency so the solution
>>>> would have to be performant and not fragile under stress.
>>>>
>>>>
>>>>
>>>> We’ve played with a schema that just has N distinct columns with one
>>>> value in each but have found overwrites seem to perform much poorer than
>>>> wide rows.  The use case we tested only required we store the most recent
>>>> value:
>>>>
>>>>
>>>>
>>>> CREATE TABLE eventyvalue_overwrite(
>>>>
>>>>     system_name text,
>>>>
>>>>     event_name text,
>>>>
>>>>     event_time timestamp,
>>>>
>>>>     event_value blob,
>>>>
>>>>     PRIMARY KEY (system_name,event_name))
>>>>
>>>>
>>>>
>>>> CREATE TABLE eventvalue_widerow (
>>>>
>>>>     system_name text,
>>>>
>>>>     event_name text,
>>>>
>>>>     event_time timestamp,
>>>>
>>>>     event_value blob,
>>>>
>>>>     PRIMARY KEY ((system_name, event_name), event_time))
>>>>
>>>>     WITH CLUSTERING ORDER BY (event_time DESC)
>>>>
>>>>
>>>>
>>>> We tested it against the DataStax AMI on EC2 with 6 nodes, replication
>>>> 3, write consistency 2, and default settings with a write only workload and
>>>> got 190K/s for wide row and 150K/s for overwrite.  Thinking through the
>>>> write path it seems the performance should be pretty similar, with probably
>>>> smaller sstables for the overwrite schema, can anyone explain the big
>>>> difference?
>>>>
>>>>
>>>>
>>>> The wide row solution is more complex in that it requires a separate
>>>> clean up thread that will handle deleting the extra values.  If that’s the
>>>> path we have to follow we’re thinking we’d add a bucket of some sort so
>>>> that we can delete an entire partition at a time after copying some values
>>>> forward, on the assumption that deleting the whole partition is much better
>>>> than deleting some slice of the partition.  Is that true?  Also, is there
>>>> any difference between setting a really short ttl and doing a delete?
>>>>
>>>>
>>>>
>>>> I know there are a lot of questions in there but we’ve been going back
>>>> and forth on this for a while and I’d really appreciate any help you could
>>>> give.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> John
>>>>
>>>
>>>
>>>
>>> --
>>> Kurt Greaves
>>> k...@instaclustr.com
>>> www.instaclustr.com
>>>
>>
>>
>
>
> --
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>



-- 
Kurt Greaves
k...@instaclustr.com
www.instaclustr.com

Re: Interesting use case

Reply via email to