Spark or Tachyon: capture data lineage

Jerry Lam Fri, 02 Jan 2015 12:27:06 -0800

Hi spark developers,

I was thinking it would be nice to extract the data lineage information
from a data processing pipeline. I assume that spark/tachyon keeps this
information somewhere. For instance, a data processing pipeline uses
datasource A and B to produce C. C is then used by another process to
produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
useful if there is a way to capture this information when we are using
spark/tachyon to query this data lineage information. For example, give me
datasets that produce E. It should give me  a graph like (A and B)->C->E.


Is this something already possible with spark/tachyon? If not, do you think
it is possible? Does anyone mind to share their experience in capturing the
data lineage in a data processing pipeline?

Best Regards,

Jerry

Spark or Tachyon: capture data lineage

Reply via email to