Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)->C->E.
Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry