Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Jonathan Haddad Fri, 04 Jan 2019 10:06:49 -0800

If you're overwriting values, it really doesn't matter much if it's a
tombstone or any other value, they still need to be compacted and have the
same overhead at read time.


Tombstones are problematic when you try to use Cassandra as a queue (or
something like a queue) and you need to scan over thousands of tombstones
in order to get to the real data.  You're simply overwriting a row and
trying to avoid a single tombstone.

Maybe I'm missing something here.  Why do you think overwriting a single
cell with a tombstone is any worse than overwriting a single cell with a
value?

Jon


On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <tomas.barta...@gmail.com>
wrote:

> Hello,
>
> I beleive your approach is the same as using spark with "
> spark.cassandra.output.ignoreNulls=true"
> This will not cover the situation when a value have to be overwriten with
> null.
>
> I found one possible solution - change the schema to keep only primary key
> fields and move all other fields to frozen UDT.
> create table (year, month, day, id, frozen<Event>, primary key((year,
> month, day), id) )
> In this way anything that is null inside event doesn't create tombstone,
> since event is serialized to BLOB.
> The penalty is in need of deserializing the whole Event when selecting
> only few columns.
> Can anyone confirm if this is good solution performance wise?
>
> Thank you,
>
> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduy...@gmail.com wrote:
>
>> "The problem is I can't know the combination of set/unset values" -->
>> Just for this requirement, Achilles has a working solution for many years
>> using INSERT_NOT_NULL_FIELDS strategy:
>>
>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>
>> Or you can use the Update API that by design only perform update on not
>> null fields:
>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>
>>
>> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
>> statement, Achilles will check its prepared statement cache and if the
>> statement does not exist yet, create a new prepared statement and put it
>> into the cache for later re-use for you
>>
>> Disclaiment: I'm the creator of Achilles
>>
>>
>>
>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <tomas.barta...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> The problem is I can't know the combination of set/unset values. From my
>>> perspective every value should be set. The event from Kafka represents the
>>> complete state of the happening at certain point in time. In my table I
>>> want to store the latest event so the most recent state of the happening
>>> (in this table I don't care about the history). Actually I used wrong
>>> expression since its just the opposite of "incremental update", every event
>>> carries all data (state) for specific point of time.
>>>
>>> The event is represented with nested json structure. Top level elements
>>> of the json are table fields with type like text, boolean, timestamp, list
>>> and the nested elements are UDT fields.
>>>
>>> Simplified example:
>>> There is a new purchase for the happening, event:
>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>> I don't know what actually happened for this event, maybe there is a new
>>> item purchased, maybe some customer info have been changed, maybe the
>>> specials have been revoked and I have to reset them. I just need to store
>>> the state as it artived from Kafka, there might already be an event for
>>> this happening saved before, or maybe this is the first one.
>>>
>>> BR,
>>> Tomas
>>>
>>>
>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <migh...@gmail.com wrote:
>>>
>>>> Depending on the use case, creating separate prepared statements for
>>>> each combination of set / unset values in large INSERT/UPDATE statements
>>>> may be prohibitive.
>>>>
>>>> Instead, you can look into driver level support for UNSET values.
>>>> Requires Cassandra 2.2 or later IIRC.
>>>>
>>>> See:
>>>> Java Driver:
>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>> Python Driver:
>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>> Node Driver:
>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>
>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>> sean_r_dur...@homedepot.com> wrote:
>>>>
>>>>> You say the events are incremental updates. I am interpreting this to
>>>>> mean only some columns are updated. Others should keep their original
>>>>> values.
>>>>>
>>>>> You are correct that inserting null creates a tombstone.
>>>>>
>>>>> Can you only insert the columns that actually have new values? Just
>>>>> skip the columns with no information. (Make the insert generator a bit
>>>>> smarter.)
>>>>>
>>>>> Create table happening (id text primary key, event text, a text, b
>>>>> text, c text);
>>>>> Insert into table happening (id, event, a, b, c) values
>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>> pm","Grand Ballroom");
>>>>> -- b changes
>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>
>>>>>
>>>>> Sean Durity
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tomas Bartalos <tomas.barta...@gmail.com>
>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>> To: user@cassandra.apache.org
>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>>>
>>>>> Hello,
>>>>>
>>>>> I’d start with describing my use case and how I’d like to use
>>>>> Cassandra to solve my storage needs.
>>>>> We're processing a stream of events for various happenings. Every
>>>>> event have a unique happening_id.
>>>>> One happening may have many events, usually ~ 20-100 events. I’d like
>>>>> to store only the latest event for the same happening (Event is an
>>>>> incremental update and it contains all up-to date data about happening).
>>>>> Technically the events are streamed from Kafka, processed with Spark
>>>>> an saved to Cassandra.
>>>>> In Cassandra we use upserts (insert with same primary key).  So far so
>>>>> good, however there comes the tombstone...
>>>>>
>>>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>>>> for this field. As I understood this is due to space efficiency, Cassandra
>>>>> doesn’t have to remember there is a NULL value, she just deletes the
>>>>> respective column and a delete creates a ... tombstone.
>>>>> I was hoping there could be an option to tell Cassandra not to be so
>>>>> space effective and store “unset" info without generating tombstones.
>>>>> Something similar to inserting empty strings instead of null values:
>>>>>
>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>>>>
>>>>> Possible solutions:
>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable
>>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. 
>>>>> ignore
>>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>>>> good since this will never overwrite previously inserted event field with
>>>>> “empty” one.
>>>>> 3. On inserts with spark, find all NULL values and replace them with
>>>>> “empty” equivalent (empty string for text, 0 for integer). Very 
>>>>> inefficient
>>>>> and problematic to find “empty” equivalent for some data types.
>>>>>
>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>> Could you please give me some advice how to solve this problem ?
>>>>>
>>>>> Thank you,
>>>>> Tomas
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>>>
>>>>>
>>>>> ________________________________
>>>>>
>>>>> The information in this Internet Email is confidential and may be
>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>> this
>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>> When addressed to our clients any opinions or advice contained in this
>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>> The
>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>> content of this attachment and for any damages or losses arising from any
>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>> items of a destructive nature, which may be contained in this attachment
>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>> damages in connection with this e-mail message or its attachment.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>>>
>>>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Reply via email to