Re: Spark Streaming: DStream - zipWithIndex

Tathagata Das Thu, 28 Aug 2014 02:21:07 -0700

If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach an UUID to each
record?




On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar <kumar.soumi...@gmail.com>
wrote:

> I see a issue here.
>
> If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.
>
> I wish there was DStream mapPartitionsWithIndex.
>
>
> On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> You can use RDD id as the seed, which is unique in the same spark
>> context. Suppose none of the RDDs would contain more than 1 billion
>> records. Then you can use
>>
>> rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid)
>>
>> Just a hack ..
>>
>> On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
>> <kumar.soumi...@gmail.com> wrote:
>> > So, I guess zipWithUniqueId will be similar.
>> >
>> > Is there a way to get unique index?
>> >
>> >
>> > On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng <men...@gmail.com>
>> wrote:
>> >>
>> >> No. The indices start at 0 for every RDD. -Xiangrui
>> >>
>> >> On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
>> >> <kumar.soumi...@gmail.com> wrote:
>> >> > Hello,
>> >> >
>> >> > If I do:
>> >> >
>> >> > DStream transform {
>> >> >     rdd.zipWithIndex.map {
>> >> >
>> >> >         Is the index guaranteed to be unique across all RDDs here?
>> >> >
>> >> > }
>> >> > }
>> >> >
>> >> > Thanks,
>> >> > -Soumitra.
>> >
>> >
>>
>
>

Re: Spark Streaming: DStream - zipWithIndex

Reply via email to