Re: Numbering RDD members Sequentially

Mark Hamstra Wed, 11 Mar 2015 17:56:06 -0700

>
> not quite sure why it is called zipWithIndex since zipping is not involved
>


It isn't?
http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming

On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: Steve Lewis <lordjoe2...@gmail.com>
> Date: Wed, Mar 11, 2015 at 9:13 AM
> Subject: Re: Numbering RDD members Sequentially
> To: "Daniel, Ronald (ELS-SDG)" <r.dan...@elsevier.com>
>
>
> perfect - exactly what I was looking for, not quite sure why it is called 
> zipWithIndex
> since zipping is not involved
> my code does something like this where IMeasuredSpectrum is a large class
> we want to set an index for
>
> public static JavaRDD<IMeasuredSpectrum> 
> indexSpectra(JavaRDD<IMeasuredSpectrum> pSpectraToScore) {
>
>     JavaPairRDD<IMeasuredSpectrum,Long> indexed = 
> pSpectraToScore.zipWithIndex();
>
>     pSpectraToScore = indexed.map(new AddIndexToSpectrum()) ;
>     return pSpectraToScore;
> }
>
> public class AddIndexToSpectrum implements  
> Function<Tuple2<IMeasuredSpectrum, java.lang.Long>, IMeasuredSpectrum> {
>     @Override
>     public IMeasuredSpectrum doCall(final Tuple2<IMeasuredSpectrum, 
> java.lang.Long> v1) throws Exception {
>         IMeasuredSpectrum spec = v1._1();
>         long index = v1._2();
>         spec.setIndex(   index + 1 );
>          return spec;
>     }
>
>    }
>
>  }
>
>
> On Wed, Mar 11, 2015 at 6:57 AM, Daniel, Ronald (ELS-SDG) <
> r.dan...@elsevier.com> wrote:
>
>>  Have you looked at zipWithIndex?
>>
>>
>>
>> *From:* Steve Lewis [mailto:lordjoe2...@gmail.com]
>> *Sent:* Tuesday, March 10, 2015 5:31 PM
>> *To:* user@spark.apache.org
>> *Subject:* Numbering RDD members Sequentially
>>
>>
>>
>> I have Hadoop Input Format which reads records and produces
>>
>>
>>
>> JavaPairRDD<String,String> locatedData  where
>>
>> _1() is a formatted version of the file location - like
>>
>> "000012690",, "000024386 ."000027523 ...
>>
>> _2() is data to be processed
>>
>>
>>
>> For historical reasons  I want to convert _1() into in integer
>> representing the record number.
>>
>> so keys become "00000001", "0000002" ...
>>
>>
>>
>> (Yes I know this cannot be done in parallel) The PairRDD may be too large
>> to collect and work on one machine but small enough to handle on a single
>> machine.
>>  I could use toLocalIterator to guarantee execution on one machine but
>> last time I tried this all kinds of jobs were launched to get the next
>> element of the iterator and I was not convinced this approach was efficient.
>>
>>
>>
>
>

Re: Numbering RDD members Sequentially

Reply via email to