Re: Data cleansing in modern data architecture

Adaryl "Bob" Wakefield, MBA Sun, 10 Aug 2014 11:07:24 -0700

I quickly went over the wikipedia page for temporal databases. Just sounds like 
a slowly changing dimension. Data being valid at different points in time isn’t 
weird to me. Most where clauses off a data warehouse are going to use the date 
dimension. What we’re talking about is data that was never correct at any time. 
After the application has been debugged, there is zero reason to retain 
invoices (or any other object) that should have never existed in the first 
place. If you were to just create a view that didn’t expose the bad records, it 
seems to me that the where clause of that view would grow to unmanageable 
proportions as various data bugs popped up over the life of the application.


I should put an asterisk by all of this and say that I haven’t mastered all of 
these various applications yet. My current understanding of Hive is that while 
it’s a “warehouse solution” it’s not actually separate data. It just stores 
table metadata and the data itself is stored on HDFS correct? That’s confusing 
because it talks about loading data into tables but then there is also no row 
level delete functionality.

Side note: There are a lot of books out there on the mechanics of Hadoop and 
the individual projects, but there is very little on how to actually manage 
data in Hadoop. 

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Bertrand Dechoux 
Sent: Sunday, August 10, 2014 10:04 AM
To: [email protected] 
Subject: Re: Data cleansing in modern data architecture

Well, keeping bad data has its use too. I assume you know about temporal 
database. 

Back to your use case, if you only need to remove a few records from HDFS 
files, the easiest might be during the reading. It is a view, of course, but it 
doesn't mean you need to write it back to HDFS. All your data analysis can be 
represented as a dataflow. Wether you need to save the data at a given point of 
the flow is your call. The tradeoff is of course between the time gained for 
later analysis versus the time required to write the state durably (to HDFS).

Bertrand Dechoux



On Sun, Aug 10, 2014 at 4:32 PM, Sriram Ramachandrasekaran 
<[email protected]> wrote:

  Ok. If you think, the noise levels in the data is going to be so less, doing 
the view creation is probably costly and meaningless. 
  HDFS is append-only. So, there's no point writing the transactions as HDFS 
files and trying to perform analytics on top of it directly.

  Instead, you could go with HBase, with rowkeys being the keys with which you 
identify and resolve these transactional errors. 
  i.e., if you had a faulty transaction data being logged, it would have and 
transaction ID, along with associated data. 
  In case, you found out that a particular invoice is offending, you could just 
remove it from HBase using the transaction ID (rowkey).

  But, if you want to use the same table for running different reports, it 
might not work out. 
  Because, most of the HBase operations depend on the rowkey and in this table 
containing transaction ID as the rowkey 
  there's no way your reports would leverage it. So, you need to decide which 
is a costly operation, removing noise or running _more_ adhoc

  reports and decide to use a different table(which is a view again) for 
reports, etc.





  On Sun, Aug 10, 2014 at 2:02 PM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

    It’s a lot of theory right now so let me give you the full background and 
see if we can more refine the answer.

    I’ve had a lot of clients with data warehouses that just weren’t functional 
for various reasons. I’m researching Hadoop to try and figure out a way to 
totally eliminate traditional data warehouses. I know all the arguments for 
keeping them around and I’m not impressed with any of them. I’ve noticed for a 
while that traditional data storage methods just aren’t up to the task for the 
things we’re asking data to do these days.

    I’ve got MOST of it figured out. I know how to store and deliver analytics 
using all the various tools within the Apache project (and some NOT in the 
Apache project). What I haven’t figured out is how to do data cleansing or 
master data management both of which are hard to do if you can’t change 
anything.

    So let’s say there is a transactional system. It’s a web application that 
is the businesses main source of revenue. All the activity of the user on the 
website is easily structured (so basically we’re not dealing with un-structured 
data). The nature of the data is financial.

    The pipeline is fairly straight forward. The data is extracted from the 
transactional system and placed into a Hadoop environment. From there, it’s 
exposed by Hive so non technical business analyst with SQL skills can  do what 
they need to do. Pretty typical right?

    The problem is the web app is not perfect and occasionally produces junk 
data. Nothing obvious. It may be a few days before the error is noticed. An 
example would be phantom invoices. Those invoices get in Hadoop. A few days 
later an analyst notices that the invoice figures for some period are inflated. 

    Once we identify the offending records there is NO reason for them to 
remain in the system; it’s meaningless junk data. Those records are of zero 
value. I encounter this scenario in the real world quite often. In the old 
world, we would just blow away the offending records. Just write a view to skip 
over a couple of records or exclude a few dozen doesn’t make much sense. It’s 
better to just blow these records away, I’m just not certain what the best way 
to accomplish that is in the new world.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba
    Twitter: @BobLovesData

    From: Sriram Ramachandrasekaran 
    Sent: Saturday, August 09, 2014 11:55 PM
    To: [email protected] 
    Subject: Re: Data cleansing in modern data architecture

    While, I may not have enough context to your entire processing pipeline, 
here are my thoughts.

    1. It's always useful to have raw data, irrespective of if it was right or 
wrong. The way to look at it is, it's the source of truth at timestamp t.
    2. Note that, You only know that the data at timestamp t for an id X was 
wrong because, subsequent info about X seem to conflict with the one at t or 
some manual debugging finds it out.

    All systems that does reporting/analytics is better off by not meddling 
with the raw data. There should be processed or computed views of this data, 
that massages it, gets rids of noisy data, merges duplicate entries, etc and 
then finally produces an output that's suitable for your reports/analytics. So, 
your idea to write transaction logs to HDFS is fine(unless, you are twisting 
your systems to get it that way), but, you just need to introduce one more 
layer of indirection, which has the business logic to handle noise/errors like 
this. 

    For your specific case, you could've a transaction processor up job which 
produces a view, that takes care of squashing transactions based on 
id(something that makes sense in your system) and then handles the business 
logic of how to handle the bugs/discrepancies in them. Your views could be 
loaded into a nice columnar store for faster query retrieval(if you have 
pointed queries - based on a key), else, a different store would be needed. 
Yes, this has the overhead of running the view creation job, but, I think, the 
ability to go back to raw data and investigate what happened there is worth it. 

    Your approach of structuring it and storing it in HBase is also fine as 
long as you keep the concerns separate(if your write/read workloads are poles 
apart).

    Hope this helps.






    On Sun, Aug 10, 2014 at 9:36 AM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

      Or...as an alternative, since HBASE uses HDFS to store it’s data, can we 
get around the no editing file rule by dropping structured data into HBASE? 
That way, we have data in HDFS that can be deleted. Any real problem with that 
idea?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 8:55 PM
      To: [email protected] 
      Subject: Re: Data cleansing in modern data architecture

      Answer: No we can’t get rid of bad records. We have to go back and 
rebuild the entire file. We can’t edit records but we can get rid of entire 
files right? This would suggest that appending data to files isn’t that great 
of an idea. It sounds like it would be more appropriate to cut a hadoop data 
load up into periodic files (days, months, etc.) that can easily be rebuilt 
should errors occur....

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba
      Twitter: @BobLovesData

      From: Adaryl "Bob" Wakefield, MBA 
      Sent: Saturday, August 09, 2014 4:01 AM
      To: [email protected] 
      Subject: Re: Data cleansing in modern data architecture

      I’m sorry but I have to revisit this again. Going through the reply below 
I realized that I didn’t quite get my question answered. Let me be more 
explicit with the scenario.

      There is a bug in the transactional system.
      The data gets written to HDFS where it winds up in Hive.
      Somebody notices that their report is off/the numbers don’t look right.
      We investigate and find the bug in the transactional system.

      Question: Can we then go back into HDFS and rid ourselves of the bad 
records? If not, what is the recommended course of action?

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Shahab Yunus 
      Sent: Sunday, July 20, 2014 4:20 PM
      To: [email protected] 
      Subject: Re: Data cleansing in modern data architecture

      I am assuming you meant the batch jobs that are/were used in old world 
for data cleansing. 

      As far as I understand there is no hard and fast rule for it and it 
depends functional and system requirements of the usecase. 

      It is also dependent on the technology being used and how it manages 
'deletion'.

      E.g. in HBase or Cassandra, you can write batch jobs which clean or 
correct or remove unwanted or incorrect data and than the underlying stores 
usually have a concept of compaction which not only defragments data files but 
also at this point removes from disk all the entries marked as deleted.

      But there are considerations to be aware of given that compaction is a 
heavy process and in some cases (e.g. Cassandra) there can be problems when 
there are too much data to be removed. Not only that, in some cases, 
marked-to-be-deleted data, until it is deleted/compacted can slow down normal 
operations of the data store as well.

      One can also leverage in HBase's case the versioning mechanism and the 
afore-mentioned batch job can simply overwrite the same row key and the 
previous version would no longer be the latest. If max-version parameter is 
configured as 1 then no previous version would be maintained (physically it 
would be and would be removed at compaction time but would not be query-able.)

      In the end, basically cleansing can be done after or before loading but 
given the append-only and no hard-delete design approaches of most nosql 
stores, I would say it would be easier to do cleaning before data is loaded in 
the nosql store. Of course, it bears repeating that it depends on the use case.

      Having said that, on a side-note and a bit off-topic, it reminds me of 
the Lamda Architecture that combines batch and real-time computation for big 
data using various technologies and it uses the idea of constant periodic 
refreshes to reload the data and within this periodic refresh, the expectations 
are that any invalid older data would be corrected and overwritten by the new 
refresh load. Those basically the 'batch part' of the LA takes care of data 
cleansing by reloading everything. But LA is mostly for thouse systems which 
are ok with eventually consistent behavior and might not be suitable for some 
systems.

      Regards,
      Shahab



      On Sun, Jul 20, 2014 at 2:36 PM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

        In the old world, data cleaning used to be a large part of the data 
warehouse load. Now that we’re working in a schemaless environment, I’m not 
sure where data cleansing is supposed to take place. NoSQL sounds fun because 
theoretically you just drop everything in but transactional systems that 
generate the data are still full of bugs and create junk data. 

        My question is, where does data cleaning/master data management/CDI 
belong in a modern data architecture? Before it hit hits Hadoop? After?

        B.





    -- 
    It's just about how deep your longing is!





  -- 
  It's just about how deep your longing is!

Re: Data cleansing in modern data architecture

Reply via email to