Exception files would only work in the case where a known error is thrown. The specific case I was trying to find a solution for is when data is the result of bugs in the transactional system or some other system that generates data based on human interaction. Here is an example:
Customer Service Reps record interactions with clients through a web application. There is a bug in the web application such that invoices get double entered. This double entering goes on for days until it’s discovered by someone in accounting. We now have to go in an remove those double entries because it’s messing up every SUM() function result. In the old world, it was simply a matter of going in the warehouse and blowing away those records. I think the solution we came up with is instead of dropping that data into a file, drop it into HBASE where you can do row level deletes. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Jens Scheidtmann Sent: Monday, August 18, 2014 12:53 PM To: [email protected] Subject: Re: Data cleansing in modern data architecture Hi Bob, the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it. If you're asking for best practices, it still depends: - How large are your files? - Have you enough free space for recoding? - Are you better off writing an "exception" file? - How do you make sure it is always respected? - etc. Best regards, Jens
