Exception files would only work in the case where a known error is thrown. The 
specific case I was trying to find a solution for is when data is the result of 
bugs in the transactional system or some other system that generates data based 
on human interaction. Here is an example:

Customer Service Reps record interactions with clients through a web 
application.
There is a bug in the web application such that invoices get double entered. 
This double entering goes on for days until it’s discovered by someone in 
accounting.
We now have to go in an remove those double entries because it’s messing up 
every SUM() function result.

In the old world, it was simply a matter of going in the warehouse and blowing 
away those records. I think the solution we came up with is instead of dropping 
that data into a file, drop it into HBASE where you can do row level deletes.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Jens Scheidtmann 
Sent: Monday, August 18, 2014 12:53 PM
To: [email protected] 
Subject: Re: Data cleansing in modern data architecture

Hi Bob, 


the answer to your original question depends entirely on the procedures and 
conventions set forth for your data warehouse. So only you can answer it.


If you're asking for best practices, it still depends:

- How large are your files?

- Have you enough free space for recoding?

- Are you better off writing an "exception" file?

- How do you make sure it is always respected?

- etc.


Best regards,

Jens

Reply via email to