Agreed with Jerry. Aside from Tachyon, seeing this for general debugging
would be very helpful.

Haoyuan, is that feature you are referring to related to
https://issues.apache.org/jira/browse/SPARK-975?

In the interim, I've found the "toDebugString()" method useful (but it
renders execution as a tree and not as a more general DAG and therefore
doesn't always capture the flow in the way I'd like to review it). Example:

>>> a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda
x: x[1]>1000)
>>> b = a.join(a)
>>> print b.toDebugString()
(16) PythonRDD[19] at RDD at PythonRDD.scala:43
 |   MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2
 |   ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2
 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261
    |   PythonRDD[14] at RDD at PythonRDD.scala:43
    |   UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2
    |   PythonRDD[11] at RDD at PythonRDD.scala:43
    |   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315
    |   PythonRDD[12] at RDD at PythonRDD.scala:43
    |   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315

Best,
-Sven

On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li <haoyuan...@gmail.com> wrote:

> Jerry,
>
> Great question. Spark and Tachyon capture lineage information at different
> granularities. We are working on an integration between Spark/Tachyon about
> this. Hope to get it ready to be released soon.
>
> Best,
>
> Haoyuan
>
> On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi spark developers,
>>
>> I was thinking it would be nice to extract the data lineage information
>> from a data processing pipeline. I assume that spark/tachyon keeps this
>> information somewhere. For instance, a data processing pipeline uses
>> datasource A and B to produce C. C is then used by another process to
>> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
>> useful if there is a way to capture this information when we are using
>> spark/tachyon to query this data lineage information. For example, give me
>> datasets that produce E. It should give me  a graph like (A and B)->C->E.
>>
>> Is this something already possible with spark/tachyon? If not, do you
>> think it is possible? Does anyone mind to share their experience in
>> capturing the data lineage in a data processing pipeline?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>



-- 
http://sites.google.com/site/krasser/?utm_source=sig

Reply via email to