Hi Jeff, Yes, you are absolutely right. It is because of the RecordReader reusing the Writable Instance. I did not anticipate this as it worked for text files.
Thank you so much for doing this. Your answer is accepted! Best, Thamme -- *Thamme Gowda N. * Grad Student at usc.edu Twitter: @thammegowda Website : http://scf.usc.edu/~tnarayan/ On Tue, Mar 22, 2016 at 9:00 PM, Jeff Zhang <[email protected]> wrote: > Zhan's reply on stackoverflow is correct. > > > down vote > > Please refer to the comments in sequenceFile. > > /** Get an RDD for a Hadoop SequenceFile with given key and value types. * > * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable > object for each * record, directly caching the returned RDD or directly > passing it to an aggregation or shuffle * operation will create many > references to the same object. * If you plan to directly cache, sort, or > aggregate Hadoop writable objects, you should first * copy them using a > map function. */ > > > > On Wed, Mar 23, 2016 at 11:58 AM, Jeff Zhang <[email protected]> wrote: > >> I think I got the root cause, you can use Text.toString() to solve this >> issue. Because the Text is shared so the last record display multiple >> times. >> >> On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang <[email protected]> wrote: >> >>> Looks like a spark bug. I can reproduce it for sequence file, but it >>> works for text file. >>> >>> On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. <[email protected]> >>> wrote: >>> >>>> Hi spark experts, >>>> >>>> I am facing issues with cached RDDs. I noticed that few entries >>>> get duplicated for n times when the RDD is cached. >>>> >>>> I asked a question on Stackoverflow with my code snippet to reproduce >>>> it. >>>> >>>> I really appreciate if you can visit >>>> http://stackoverflow.com/q/36168827/1506477 >>>> and answer my question / give your comments. >>>> >>>> Or at the least confirm that it is a bug. >>>> >>>> Thanks in advance for your help! >>>> >>>> -- >>>> Thamme >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > Best Regards > > Jeff Zhang >
