Agreed with Jerry. Aside from Tachyon, seeing this for general debugging would be very helpful.
Haoyuan, is that feature you are referring to related to https://issues.apache.org/jira/browse/SPARK-975? In the interim, I've found the "toDebugString()" method useful (but it renders execution as a tree and not as a more general DAG and therefore doesn't always capture the flow in the way I'd like to review it). Example: >>> a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda x: x[1]>1000) >>> b = a.join(a) >>> print b.toDebugString() (16) PythonRDD[19] at RDD at PythonRDD.scala:43 | MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2 | ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261 | PythonRDD[14] at RDD at PythonRDD.scala:43 | UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2 | PythonRDD[11] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 | PythonRDD[12] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 Best, -Sven On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li <haoyuan...@gmail.com> wrote: > Jerry, > > Great question. Spark and Tachyon capture lineage information at different > granularities. We are working on an integration between Spark/Tachyon about > this. Hope to get it ready to be released soon. > > Best, > > Haoyuan > > On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi spark developers, >> >> I was thinking it would be nice to extract the data lineage information >> from a data processing pipeline. I assume that spark/tachyon keeps this >> information somewhere. For instance, a data processing pipeline uses >> datasource A and B to produce C. C is then used by another process to >> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so >> useful if there is a way to capture this information when we are using >> spark/tachyon to query this data lineage information. For example, give me >> datasets that produce E. It should give me a graph like (A and B)->C->E. >> >> Is this something already possible with spark/tachyon? If not, do you >> think it is possible? Does anyone mind to share their experience in >> capturing the data lineage in a data processing pipeline? >> >> Best Regards, >> >> Jerry >> > > > > -- > Haoyuan Li > AMPLab, EECS, UC Berkeley > http://www.cs.berkeley.edu/~haoyuan/ > -- http://sites.google.com/site/krasser/?utm_source=sig