If you data is in different partitions in HDFS, you can simply use tools like Hive or Pig to read the data in a give partition, filter out the bad data and overwrite the partition. This data cleansing is common practice, I'm not sure why there is such a back and forth on this topic. Of course HBase approach works too, but I think that would make sense if you have a large number of bad record frequently, otherwise running a weekly or nightly scan over you data and reading and writing them back, typically map/reduce, is what is the conventional way to do it in HDFS.
On Mon, Aug 18, 2014 at 3:06 PM, Adaryl "Bob" Wakefield, MBA < [email protected]> wrote: > Exception files would only work in the case where a known error is > thrown. The specific case I was trying to find a solution for is when data > is the result of bugs in the transactional system or some other system that > generates data based on human interaction. Here is an example: > > Customer Service Reps record interactions with clients through a web > application. > There is a bug in the web application such that invoices get double > entered. > This double entering goes on for days until it’s discovered by someone in > accounting. > We now have to go in an remove those double entries because it’s messing > up every SUM() function result. > > In the old world, it was simply a matter of going in the warehouse and > blowing away those records. I think the solution we came up with is instead > of dropping that data into a file, drop it into HBASE where you can do row > level deletes. > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > *From:* Jens Scheidtmann <[email protected]> > *Sent:* Monday, August 18, 2014 12:53 PM > *To:* [email protected] > *Subject:* Re: Data cleansing in modern data architecture > > Hi Bob, > > the answer to your original question depends entirely on the procedures > and conventions set forth for your data warehouse. So only you can answer > it. > > If you're asking for best practices, it still depends: > - How large are your files? > - Have you enough free space for recoding? > - Are you better off writing an "exception" file? > - How do you make sure it is always respected? > - etc. > > Best regards, > > Jens > >
