Agreed with Jerry. Aside from Tachyon, seeing this for general debugging
would be very helpful.

Haoyuan, is that feature you are referring to related to

In the interim, I've found the "toDebugString()" method useful (but it
renders execution as a tree and not as a more general DAG and therefore
doesn't always capture the flow in the way I'd like to review it). Example:

>>> a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda
x: x[1]>1000)
>>> b = a.join(a)
>>> print b.toDebugString()
(16) PythonRDD[19] at RDD at PythonRDD.scala:43
 |   MappedRDD[17] at values at
 |   ShuffledRDD[16] at partitionBy at
 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261
    |   PythonRDD[14] at RDD at PythonRDD.scala:43
    |   UnionRDD[13] at union at
    |   PythonRDD[11] at RDD at PythonRDD.scala:43
    |   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315
    |   PythonRDD[12] at RDD at PythonRDD.scala:43
    |   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315


On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li <> wrote:

> Jerry,
> Great question. Spark and Tachyon capture lineage information at different
> granularities. We are working on an integration between Spark/Tachyon about
> this. Hope to get it ready to be released soon.
> Best,
> Haoyuan
> On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <> wrote:
>> Hi spark developers,
>> I was thinking it would be nice to extract the data lineage information
>> from a data processing pipeline. I assume that spark/tachyon keeps this
>> information somewhere. For instance, a data processing pipeline uses
>> datasource A and B to produce C. C is then used by another process to
>> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
>> useful if there is a way to capture this information when we are using
>> spark/tachyon to query this data lineage information. For example, give me
>> datasets that produce E. It should give me  a graph like (A and B)->C->E.
>> Is this something already possible with spark/tachyon? If not, do you
>> think it is possible? Does anyone mind to share their experience in
>> capturing the data lineage in a data processing pipeline?
>> Best Regards,
>> Jerry
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley


Reply via email to