Re: data cleansing in real time systems

Nathan Marz Wed, 20 Aug 2014 00:11:52 -0700

Deletion is typically done by running a job that copies the master dataset
into a new folder, filtering out bad data along the way. This is expensive,
but that's ok since this is only done in rare circumstances. When I've done
this in the past I'm extra careful before deleting the corrupted master
dataset by collecting stats before/after to make sure I've filtered out
only the bad stuff.



On Tue, Aug 19, 2014 at 10:33 PM, Adaryl "Bob" Wakefield, MBA <
[email protected]> wrote:

>   I need help clearing something up. So I read this:
> http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
>
> And in it he says:
> “Likewise, writing bad data has a clear path to recovery: delete the bad
> data and precompute the queries again. Since data is immutable and the
> master dataset is append-only, writing bad data does not override or
> otherwise destroy good data.”
>
> That sentence makes no sense to me.
>
> Data is immutable – > master dataset is append-only – > delete the bad data
>
> What? He gives an example of in the batch layer you store raw files in
> HDFS. My understanding is that you can’t do row level deletes on files in
> HDFS (because it’s append-only). What am I missing  here?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: data cleansing in real time systems

Reply via email to