> > not quite sure why it is called zipWithIndex since zipping is not involved >
It isn't? http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > > ---------- Forwarded message ---------- > From: Steve Lewis <lordjoe2...@gmail.com> > Date: Wed, Mar 11, 2015 at 9:13 AM > Subject: Re: Numbering RDD members Sequentially > To: "Daniel, Ronald (ELS-SDG)" <r.dan...@elsevier.com> > > > perfect - exactly what I was looking for, not quite sure why it is called > zipWithIndex > since zipping is not involved > my code does something like this where IMeasuredSpectrum is a large class > we want to set an index for > > public static JavaRDD<IMeasuredSpectrum> > indexSpectra(JavaRDD<IMeasuredSpectrum> pSpectraToScore) { > > JavaPairRDD<IMeasuredSpectrum,Long> indexed = > pSpectraToScore.zipWithIndex(); > > pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ; > return pSpectraToScore; > } > > public class AddIndexToSpectrum implements > Function<Tuple2<IMeasuredSpectrum, java.lang.Long>, IMeasuredSpectrum> { > @Override > public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum, > java.lang.Long> v1) throws Exception { > IMeasuredSpectrum spec = v1._1(); > long index = v1._2(); > spec.setIndex( index + 1 ); > return spec; > } > > } > > } > > > On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) < > r.dan...@elsevier.com> wrote: > >> Have you looked at zipWithIndex? >> >> >> >> *From:* Steve Lewis [mailto:lordjoe2...@gmail.com] >> *Sent:* Tuesday, March 10, 2015 5:31 PM >> *To:* user@spark.apache.org >> *Subject:* Numbering RDD members Sequentially >> >> >> >> I have Hadoop Input Format which reads records and produces >> >> >> >> JavaPairRDD<String,String> locatedData where >> >> _1() is a formatted version of the file location - like >> >> "000012690",, "000024386 ."000027523 ... >> >> _2() is data to be processed >> >> >> >> For historical reasons I want to convert _1() into in integer >> representing the record number. >> >> so keys become "00000001", "0000002" ... >> >> >> >> (Yes I know this cannot be done in parallel) The PairRDD may be too large >> to collect and work on one machine but small enough to handle on a single >> machine. >> I could use toLocalIterator to guarantee execution on one machine but >> last time I tried this all kinds of jobs were launched to get the next >> element of the iterator and I was not convinced this approach was efficient. >> >> >> > >