Thanks alot , But i have already tried the second way ,Problem with that is that how to identify the particular RDD from source to sink (as we can do by passing a msg id in storm) . For that i just updated RDD and added a msgID (as static variable) . but while dumping them to file some of the tuples of RDD are failed/missed (approx 3000 and data rate is aprox 1500 tuples/sec).
On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das <t...@databricks.com> wrote: > Couple of ways. > > 1. Easy but approx way: Find scheduling delay and processing time using > StreamingListener interface, and then calculate "end-to-end delay = 0.5 * > batch interval + scheduling delay + processing time". The 0.5 * batch > inteval is the approx average batching delay across all the records in the > batch. > > 2. Hard but precise way: You could build a custom receiver that embeds the > current timestamp in the records, and then compare them with the timestamp > at the final step of the records. Assuming the executor and driver clocks > are reasonably in sync, this will measure the latency between the time is > received by the system and the result from the record is available. > > On Thu, Jun 18, 2015 at 2:12 PM, anshu shukla <anshushuk...@gmail.com> > wrote: > >> Sorry , i missed the LATENCY word.. for a large streaming query .How to >> find the time taken by the particular RDD to travel from initial >> D-STREAM to final/last D-STREAM . >> Help Please !! >> >> On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> Its not clear what you are asking. Find "what" among RDD? >>> >>> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <anshushuk...@gmail.com> >>> wrote: >>> >>>> Is there any fixed way to find among RDD in stream processing systems >>>> , in the Distributed set-up . >>>> >>>> -- >>>> Thanks & Regards, >>>> Anshu Shukla >>>> >>> >>> >> >> >> -- >> Thanks & Regards, >> Anshu Shukla >> > > -- Thanks & Regards, Anshu Shukla