Deletion is typically done by running a job that copies the master dataset into a new folder, filtering out bad data along the way. This is expensive, but that's ok since this is only done in rare circumstances. When I've done this in the past I'm extra careful before deleting the corrupted master dataset by collecting stats before/after to make sure I've filtered out only the bad stuff.
On Tue, Aug 19, 2014 at 10:33 PM, Adaryl "Bob" Wakefield, MBA < [email protected]> wrote: > I need help clearing something up. So I read this: > http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html > > And in it he says: > “Likewise, writing bad data has a clear path to recovery: delete the bad > data and precompute the queries again. Since data is immutable and the > master dataset is append-only, writing bad data does not override or > otherwise destroy good data.” > > That sentence makes no sense to me. > > Data is immutable – > master dataset is append-only – > delete the bad data > > What? He gives an example of in the batch layer you store raw files in > HDFS. My understanding is that you can’t do row level deletes on files in > HDFS (because it’s append-only). What am I missing here? > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > > -- Twitter: @nathanmarz http://nathanmarz.com
