If you're overwriting values, it really doesn't matter much if it's a tombstone or any other value, they still need to be compacted and have the same overhead at read time.
Tombstones are problematic when you try to use Cassandra as a queue (or something like a queue) and you need to scan over thousands of tombstones in order to get to the real data. You're simply overwriting a row and trying to avoid a single tombstone. Maybe I'm missing something here. Why do you think overwriting a single cell with a tombstone is any worse than overwriting a single cell with a value? Jon On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <tomas.barta...@gmail.com> wrote: > Hello, > > I beleive your approach is the same as using spark with " > spark.cassandra.output.ignoreNulls=true" > This will not cover the situation when a value have to be overwriten with > null. > > I found one possible solution - change the schema to keep only primary key > fields and move all other fields to frozen UDT. > create table (year, month, day, id, frozen<Event>, primary key((year, > month, day), id) ) > In this way anything that is null inside event doesn't create tombstone, > since event is serialized to BLOB. > The penalty is in need of deserializing the whole Event when selecting > only few columns. > Can anyone confirm if this is good solution performance wise? > > Thank you, > > On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduy...@gmail.com wrote: > >> "The problem is I can't know the combination of set/unset values" --> >> Just for this requirement, Achilles has a working solution for many years >> using INSERT_NOT_NULL_FIELDS strategy: >> >> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy >> >> Or you can use the Update API that by design only perform update on not >> null fields: >> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity >> >> >> Behind the scene, for each new combination of INSERT INTO table(x,y,z) >> statement, Achilles will check its prepared statement cache and if the >> statement does not exist yet, create a new prepared statement and put it >> into the cache for later re-use for you >> >> Disclaiment: I'm the creator of Achilles >> >> >> >> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <tomas.barta...@gmail.com> >> wrote: >> >>> Hello, >>> >>> The problem is I can't know the combination of set/unset values. From my >>> perspective every value should be set. The event from Kafka represents the >>> complete state of the happening at certain point in time. In my table I >>> want to store the latest event so the most recent state of the happening >>> (in this table I don't care about the history). Actually I used wrong >>> expression since its just the opposite of "incremental update", every event >>> carries all data (state) for specific point of time. >>> >>> The event is represented with nested json structure. Top level elements >>> of the json are table fields with type like text, boolean, timestamp, list >>> and the nested elements are UDT fields. >>> >>> Simplified example: >>> There is a new purchase for the happening, event: >>> {total_amount: 50, items : [A, B, C, new_item], purchase_time : >>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} >>> I don't know what actually happened for this event, maybe there is a new >>> item purchased, maybe some customer info have been changed, maybe the >>> specials have been revoked and I have to reset them. I just need to store >>> the state as it artived from Kafka, there might already be an event for >>> this happening saved before, or maybe this is the first one. >>> >>> BR, >>> Tomas >>> >>> >>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <migh...@gmail.com wrote: >>> >>>> Depending on the use case, creating separate prepared statements for >>>> each combination of set / unset values in large INSERT/UPDATE statements >>>> may be prohibitive. >>>> >>>> Instead, you can look into driver level support for UNSET values. >>>> Requires Cassandra 2.2 or later IIRC. >>>> >>>> See: >>>> Java Driver: >>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >>>> Python Driver: >>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >>>> Node Driver: >>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >>>> >>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>>> sean_r_dur...@homedepot.com> wrote: >>>> >>>>> You say the events are incremental updates. I am interpreting this to >>>>> mean only some columns are updated. Others should keep their original >>>>> values. >>>>> >>>>> You are correct that inserting null creates a tombstone. >>>>> >>>>> Can you only insert the columns that actually have new values? Just >>>>> skip the columns with no information. (Make the insert generator a bit >>>>> smarter.) >>>>> >>>>> Create table happening (id text primary key, event text, a text, b >>>>> text, c text); >>>>> Insert into table happening (id, event, a, b, c) values >>>>> ("MainEvent","The most complete info we have right now","Priceless","10 >>>>> pm","Grand Ballroom"); >>>>> -- b changes >>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >>>>> >>>>> >>>>> Sean Durity >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Tomas Bartalos <tomas.barta...@gmail.com> >>>>> Sent: Thursday, December 27, 2018 9:27 AM >>>>> To: user@cassandra.apache.org >>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >>>>> >>>>> Hello, >>>>> >>>>> I’d start with describing my use case and how I’d like to use >>>>> Cassandra to solve my storage needs. >>>>> We're processing a stream of events for various happenings. Every >>>>> event have a unique happening_id. >>>>> One happening may have many events, usually ~ 20-100 events. I’d like >>>>> to store only the latest event for the same happening (Event is an >>>>> incremental update and it contains all up-to date data about happening). >>>>> Technically the events are streamed from Kafka, processed with Spark >>>>> an saved to Cassandra. >>>>> In Cassandra we use upserts (insert with same primary key). So far so >>>>> good, however there comes the tombstone... >>>>> >>>>> When I’m inserting field with NULL value, Cassandra creates tombstone >>>>> for this field. As I understood this is due to space efficiency, Cassandra >>>>> doesn’t have to remember there is a NULL value, she just deletes the >>>>> respective column and a delete creates a ... tombstone. >>>>> I was hoping there could be an option to tell Cassandra not to be so >>>>> space effective and store “unset" info without generating tombstones. >>>>> Something similar to inserting empty strings instead of null values: >>>>> >>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into >>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening >>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); >>>>> >>>>> Possible solutions: >>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable >>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. >>>>> ignore >>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not >>>>> good since this will never overwrite previously inserted event field with >>>>> “empty” one. >>>>> 3. On inserts with spark, find all NULL values and replace them with >>>>> “empty” equivalent (empty string for text, 0 for integer). Very >>>>> inefficient >>>>> and problematic to find “empty” equivalent for some data types. >>>>> >>>>> Until tombstones appeared Cassandra was the right fit for our use >>>>> case, however now I’m not sure if we’re heading the right direction. >>>>> Could you please give me some advice how to solve this problem ? >>>>> >>>>> Thank you, >>>>> Tomas >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>>> >>>>> >>>>> ________________________________ >>>>> >>>>> The information in this Internet Email is confidential and may be >>>>> legally privileged. It is intended solely for the addressee. Access to >>>>> this >>>>> Email by anyone else is unauthorized. If you are not the intended >>>>> recipient, any disclosure, copying, distribution or any action taken or >>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>>>> When addressed to our clients any opinions or advice contained in this >>>>> Email are subject to the terms and conditions expressed in any applicable >>>>> governing The Home Depot terms of business or client engagement letter. >>>>> The >>>>> Home Depot disclaims all responsibility and liability for the accuracy and >>>>> content of this attachment and for any damages or losses arising from any >>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>>>> items of a destructive nature, which may be contained in this attachment >>>>> and shall not be liable for direct, indirect, consequential or special >>>>> damages in connection with this e-mail message or its attachment. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>>> >>>> -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade