In the old world, data cleaning used to be a large part of the data warehouse load. Now that we’re working in a schemaless environment, I’m not sure where data cleansing is supposed to take place. NoSQL sounds fun because theoretically you just drop everything in but transactional systems that generate the data are still full of bugs and create junk data.
My question is, where does data cleaning/master data management/CDI belong in a modern data architecture? Before it hit hits Hadoop? After? B.
