data cleansing in real time systems

Adaryl "Bob" Wakefield, MBA Tue, 19 Aug 2014 22:34:44 -0700

I need help clearing something up. So I read this:
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html


And in it he says:
“Likewise, writing bad data has a clear path to recovery: delete the bad data 
and precompute the queries again. Since data is immutable and the master 
dataset is append-only, writing bad data does not override or otherwise destroy 
good data.”

That sentence makes no sense to me. 

Data is immutable – > master dataset is append-only – > delete the bad data

What? He gives an example of in the batch layer you store raw files in HDFS. My 
understanding is that you can’t do row level deletes on files in HDFS (because 
it’s append-only). What am I missing  here?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

data cleansing in real time systems

Reply via email to