---------- Forwarded message ---------- From: Steve Lewis <lordjoe2...@gmail.com> Date: Wed, Mar 11, 2015 at 9:13 AM Subject: Re: Numbering RDD members Sequentially To: "Daniel, Ronald (ELS-SDG)" <r.dan...@elsevier.com>
perfect - exactly what I was looking for, not quite sure why it is called zipWithIndex since zipping is not involved my code does something like this where IMeasuredSpectrum is a large class we want to set an index for public static JavaRDD<IMeasuredSpectrum> indexSpectra(JavaRDD<IMeasuredSpectrum> pSpectraToScore) { JavaPairRDD<IMeasuredSpectrum,Long> indexed = pSpectraToScore.zipWithIndex(); pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ; return pSpectraToScore; } public class AddIndexToSpectrum implements Function<Tuple2<IMeasuredSpectrum, java.lang.Long>, IMeasuredSpectrum> { @Override public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum, java.lang.Long> v1) throws Exception { IMeasuredSpectrum spec = v1._1(); long index = v1._2(); spec.setIndex( index + 1 ); return spec; } } } On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Have you looked at zipWithIndex? > > > > *From:* Steve Lewis [mailto:lordjoe2...@gmail.com] > *Sent:* Tuesday, March 10, 2015 5:31 PM > *To:* user@spark.apache.org > *Subject:* Numbering RDD members Sequentially > > > > I have Hadoop Input Format which reads records and produces > > > > JavaPairRDD<String,String> locatedData where > > _1() is a formatted version of the file location - like > > "000012690",, "000024386 ."000027523 ... > > _2() is data to be processed > > > > For historical reasons I want to convert _1() into in integer > representing the record number. > > so keys become "00000001", "0000002" ... > > > > (Yes I know this cannot be done in parallel) The PairRDD may be too large > to collect and work on one machine but small enough to handle on a single > machine. > I could use toLocalIterator to guarantee execution on one machine but > last time I tried this all kinds of jobs were launched to get the next > element of the iterator and I was not convinced this approach was efficient. > > >